Resilience, Security & Failure Handling¶

Split from Donna Project Spec v3.0 — Sections 3.6, 16, 17, 18

API Resilience Layer¶

Every Claude API call goes through a resilience wrapper.

Retry Policies¶

Task Category	Max Retries	Backoff	On Failure
Critical (digest, deadline reminders)	3	Exponential, 2s start, 30s cap	Fall back to degraded mode
Standard (parse, classify)	2	Exponential, 1s start, 15s cap	Queue for retry next cycle; notify user of delay
Agent work (research, code gen)	1	5s fixed	Mark `agent_status = failed`; notify user; do not retry (budget protection)

Degraded Mode Definitions¶

Morning digest: Template-based using raw calendar + task list from SQLite. No LLM, no persona. "Today's schedule: [list]. Tasks due: [list]. Note: AI digest unavailable."
Reminders: Static template: "Reminder: [task title] starts at [time]." Personality lost — acceptable.
Task parsing: Accept raw text as-is with the following field defaults. Flagged needs_reparse: true for re-processing on recovery. Never lose a capture.

Field	Default Value
title	Raw input text (truncated to 200 chars)
description	Full raw input text
priority	3 (middle of scale)
deadline	null (no assumed urgency)
domain	personal (safest default)
deadline_type	none
estimated_duration	null
agent_eligible	false

Circuit Breaker¶

5 consecutive failures in 10 minutes → circuit-breaker mode: 1. Pause all non-critical agent work 2. Switch critical paths to degraded mode 3. Send single SMS: "Donna's AI is temporarily unavailable. Running in basic mode. Will notify when restored." 4. Test recovery every 5 minutes with lightweight health-check 5. Reset on first successful response

Response Validation¶

Every API response validated against expected output schema before use. Malformed JSON, missing fields, or schema mismatches → retry (counted against retry budget).

Health Monitoring¶

Layer 1: Docker Healthchecks¶

Each service gets a healthcheck directive. Orchestrator exposes HTTP /health endpoint. Docker polls every 30s. Three failures → container restart (restart: unless-stopped).

/health checks: SQLite reachable, Discord bot connected, scheduler loop running, last API health-check < 10 min. Returns 200 or 503 with JSON listing failures.

Layer 2: External Watchdog¶

Separate lightweight process outside Docker. Every 5 min checks docker inspect --format='{{.State.Health.Status}}' donna-orchestrator. If unhealthy/stopped → alert via Twilio SMS or Discord webhook (independent of Donna bot).

Layer 3: Daily Self-Diagnostic¶

Part of morning digest generation. Before generating: DB integrity (PRAGMA integrity_check), NVMe space, last calendar sync, last Supabase sync, pending migrations, budget status. Issues prepended to morning digest.

Crash Handler & Error Logging¶

Unhandled exceptions must never fail silently. The orchestrator registers crash handlers at startup:

sys.excepthook: Catches unhandled synchronous exceptions. Logs full traceback to structlog → donna_logs.db → Loki.
asyncio exception handler: Catches unhandled async exceptions via loop.set_exception_handler(). Same logging pipeline.
On crash: Log the full traceback at CRITICAL level, trigger Layer 2 alerting (external watchdog detects container restart), and ensure the Docker healthcheck transitions to unhealthy.
Periodic health checks: The orchestrator's /health endpoint runs checks every 30 seconds (Docker healthcheck interval). On each check: SQLite reachable, Discord bot connected, scheduler loop running, last API health-check response < 10 min. Any failure returns HTTP 503 with JSON listing failures.
All errors logged to donna_logs.db with event types from the hierarchical scheme (e.g., system.crash, api.call.failed, agent.timeout). Grafana alert rules trigger on CRITICAL-level log entries.

Backup Strategy¶

Method¶

SQLite .backup API (connection.backup()). Never file copy — copying WAL-mode SQLite during writes can corrupt.

Schedule & Retention¶

Daily at 3 AM (blackout hours): full backup of donna_tasks.db and donna_logs.db
7 daily, 4 weekly (Sunday), 3 monthly (1^st) retained
Worst case: ~14 backups × 500MB = 7GB (trivial on 1TB NVMe)
Off-server: weekly/monthly pushed to cloud (GCS free 5GB or Backblaze B2 ~$0.04/month)

Recovery¶

RPO: 24 hours (last daily backup). Supabase replica reduces effective RPO to sync interval.
Procedure: Stop orchestrator → copy backup to live path → restart → orchestrator triggers Supabase re-sync. Documented in RECOVERY.md.

Security & Privacy¶

Principle	Implementation
Least privilege	Each agent has only the tools defined in task type registry
No credentials in agent context	Agents request tool calls via orchestrator. Never see raw API keys.
Sandboxed filesystem	Agents only access `/donna/workspace/`
Git safety	Feature branches only. Main/production have push protection.
Email safety	Draft-only default. Send scope behind feature flag + OAuth re-auth.
No data exfiltration	MCP server whitelists allowed outbound destinations
Tool validation	Orchestrator validates all model tool call requests before execution
Blackout enforcement	12am–6am hard block on outbound at notification service level
Log sanitization	No credentials in logs. API bodies at DEBUG only with redaction.
NVMe encryption	LUKS at rest. Decryption via TPM or entered at boot.

Acceptable Failures¶

Priority misclassification (user corrects → feeds learning)
Duplicate reminders (annoying, no data loss)
Agent produces low-quality code (user reviews before merge)
Suboptimal scheduling (user reschedules → feeds learning)
Local LLM misroutes to Claude API (costs more, completes correctly)

Unacceptable Failures¶

Missing a deadline reminder (must escalate)
Sending emails to unintended recipients (architecturally blocked)
Deleting files without backup (append/modify only)
Overwriting code without version control (always branched/stashed)
Exceeding budget without notification (synchronous cost monitoring)
Contact during blackout (hard block at notification service)
Agent running indefinitely (timeout enforced)
Learned preference causing repeated errors (auto-disabled)
Silent service failure (detected within 10 min)