Skip to content

Resilience, Security & Failure Handling

Split from Donna Project Spec v3.0 — Sections 3.6, 16, 17, 18

API Resilience Layer

Every Claude API call goes through a resilience wrapper.

Retry Policies

Task Category Max Retries Backoff On Failure
Critical (digest, deadline reminders) 3 Exponential, 2s start, 30s cap Fall back to degraded mode
Standard (parse, classify) 2 Exponential, 1s start, 15s cap Queue for retry next cycle; notify user of delay
Agent work (research, code gen) 1 5s fixed Mark agent_status = failed; notify user; do not retry (budget protection)

Degraded Mode Definitions

  • Morning digest: Template-based using raw calendar + task list from SQLite. No LLM, no persona. "Today's schedule: [list]. Tasks due: [list]. Note: AI digest unavailable."
  • Reminders: Static template: "Reminder: [task title] starts at [time]." Personality lost — acceptable.
  • Task parsing: Accept raw text as-is with the following field defaults. Flagged needs_reparse: true for re-processing on recovery. Never lose a capture.
Field Default Value
title Raw input text (truncated to 200 chars)
description Full raw input text
priority 3 (middle of scale)
deadline null (no assumed urgency)
domain personal (safest default)
deadline_type none
estimated_duration null
agent_eligible false

Circuit Breaker

5 consecutive failures in 10 minutes → circuit-breaker mode: 1. Pause all non-critical agent work 2. Switch critical paths to degraded mode 3. Send single SMS: "Donna's AI is temporarily unavailable. Running in basic mode. Will notify when restored." 4. Test recovery every 5 minutes with lightweight health-check 5. Reset on first successful response

Response Validation

Every API response validated against expected output schema before use. Malformed JSON, missing fields, or schema mismatches → retry (counted against retry budget).

Health Monitoring

Layer 1: Docker Healthchecks

Each service gets a healthcheck directive. Orchestrator exposes HTTP /health endpoint. Docker polls every 30s. Three failures → container restart (restart: unless-stopped).

/health checks: SQLite reachable, Discord bot connected, scheduler loop running, last API health-check < 10 min. Returns 200 or 503 with JSON listing failures.

Layer 2: External Watchdog

Separate lightweight process outside Docker. Every 5 min checks docker inspect --format='{{.State.Health.Status}}' donna-orchestrator. If unhealthy/stopped → alert via Twilio SMS or Discord webhook (independent of Donna bot).

Layer 3: Daily Self-Diagnostic

Part of morning digest generation. Before generating: DB integrity (PRAGMA integrity_check), NVMe space, last calendar sync, last Supabase sync, pending migrations, budget status. Issues prepended to morning digest.

Crash Handler & Error Logging

Unhandled exceptions must never fail silently. The orchestrator registers crash handlers at startup:

  • sys.excepthook: Catches unhandled synchronous exceptions. Logs full traceback to structlogdonna_logs.db → Loki.
  • asyncio exception handler: Catches unhandled async exceptions via loop.set_exception_handler(). Same logging pipeline.
  • On crash: Log the full traceback at CRITICAL level, trigger Layer 2 alerting (external watchdog detects container restart), and ensure the Docker healthcheck transitions to unhealthy.
  • Periodic health checks: The orchestrator's /health endpoint runs checks every 30 seconds (Docker healthcheck interval). On each check: SQLite reachable, Discord bot connected, scheduler loop running, last API health-check response < 10 min. Any failure returns HTTP 503 with JSON listing failures.
  • All errors logged to donna_logs.db with event types from the hierarchical scheme (e.g., system.crash, api.call.failed, agent.timeout). Grafana alert rules trigger on CRITICAL-level log entries.

Backup Strategy

Method

SQLite .backup API (connection.backup()). Never file copy — copying WAL-mode SQLite during writes can corrupt.

Schedule & Retention

  • Daily at 3 AM (blackout hours): full backup of donna_tasks.db and donna_logs.db
  • 7 daily, 4 weekly (Sunday), 3 monthly (1st) retained
  • Worst case: ~14 backups × 500MB = 7GB (trivial on 1TB NVMe)
  • Off-server: weekly/monthly pushed to cloud (GCS free 5GB or Backblaze B2 ~$0.04/month)

Recovery

  • RPO: 24 hours (last daily backup). Supabase replica reduces effective RPO to sync interval.
  • Procedure: Stop orchestrator → copy backup to live path → restart → orchestrator triggers Supabase re-sync. Documented in RECOVERY.md.

Security & Privacy

Principle Implementation
Least privilege Each agent has only the tools defined in task type registry
No credentials in agent context Agents request tool calls via orchestrator. Never see raw API keys.
Sandboxed filesystem Agents only access /donna/workspace/
Git safety Feature branches only. Main/production have push protection.
Email safety Draft-only default. Send scope behind feature flag + OAuth re-auth.
No data exfiltration MCP server whitelists allowed outbound destinations
Tool validation Orchestrator validates all model tool call requests before execution
Blackout enforcement 12am–6am hard block on outbound at notification service level
Log sanitization No credentials in logs. API bodies at DEBUG only with redaction.
NVMe encryption LUKS at rest. Decryption via TPM or entered at boot.

Acceptable Failures

  • Priority misclassification (user corrects → feeds learning)
  • Duplicate reminders (annoying, no data loss)
  • Agent produces low-quality code (user reviews before merge)
  • Suboptimal scheduling (user reschedules → feeds learning)
  • Local LLM misroutes to Claude API (costs more, completes correctly)

Unacceptable Failures

  • Missing a deadline reminder (must escalate)
  • Sending emails to unintended recipients (architecturally blocked)
  • Deleting files without backup (append/modify only)
  • Overwriting code without version control (always branched/stashed)
  • Exceeding budget without notification (synchronous cost monitoring)
  • Contact during blackout (hard block at notification service)
  • Agent running indefinitely (timeout enforced)
  • Learned preference causing repeated errors (auto-disabled)
  • Silent service failure (detected within 10 min)