Observability & Logging¶

Split from Donna Project Spec v3.0 — Sections 14, 15

Principle¶

Observability is a Phase 1 deliverable. Every Donna service emits structured JSON logs. Debugging any issue should require seconds, not SSH and grep.

Logging Framework¶

All Python services use structlog with JSON output and contextvars for async context propagation. Every incoming request binds correlation_id, user_id, channel, and task_id as context variables that appear in all downstream log entries.

Log Levels¶

Level	When	Examples
DEBUG	Detailed diagnostics. Off in prod unless troubleshooting.	Full prompts, API response bodies, dedup scores, scheduler slot evaluation
INFO	Normal operations. System working correctly.	Task created, state changed, reminder sent, digest generated, calendar synced
WARNING	Unexpected but handled.	API retry, confidence below threshold, degraded mode activated
ERROR	Operation failed but system continues.	API failed after retries, schema validation rejected, agent timeout
CRITICAL	System-level failure, immediate attention.	Circuit breaker activated, DB corruption, orchestrator crash, NVMe full

Logging Database¶

Dedicated donna_logs.db on NVMe. Separate from task DB to avoid contention.

Log Table Schema¶

Field	Type	Purpose
id	INTEGER PK	Auto-incrementing
timestamp	TEXT ISO 8601	When (UTC)
level	TEXT	DEBUG–CRITICAL
service	TEXT	orchestrator, mcp_server, discord_bot, scheduler, notification, agent_worker, sync
component	TEXT	input_parser, calendar_sync, state_machine, preference_engine, etc.
event_type	TEXT	Machine-readable: `task.created`, `api.call.failed`, `agent.timeout`
message	TEXT	Human-readable
correlation_id	TEXT	Traces single request across all services
task_id	TEXT?	Associated task UUID
user_id	TEXT?	User who triggered
agent_id	TEXT?	Agent type if from agent worker
channel	TEXT?	discord, sms, email, app, system
duration_ms	INTEGER?	For timed operations
cost_usd	REAL?	API cost if model call
error_type	TEXT?	Exception class name
error_trace	TEXT?	Full stack trace
extra	TEXT (JSON)	Additional structured context

Indexes on: timestamp, level, service, event_type, correlation_id, task_id, error_type. WAL mode.

Retention Policy¶

Level	Retention
DEBUG	7 days
INFO	30 days
WARNING	90 days
ERROR / CRITICAL	1 year
Invocation logs	Permanent (cost analysis, evaluation, preferences)

Nightly cron prunes expired logs. Weekly VACUUM reclaims disk space.

Event Types (Hierarchical)¶

task.*: created, state_changed, dedup_detected, overdue, escalation_triggered
api.*: call.started, call.completed, call.failed, call.retried, circuit_breaker.opened/closed, degraded_mode.activated
agent.*: dispatched, progress, completed, failed, timeout, interrogation.sent/response_received
scheduler.*: weekly_plan, daily_recalc, slot_assigned, conflict_detected, calendar_sync.completed/user_modification
notification.*: sent, failed, escalated, acknowledged, blackout_blocked
preference.*: correction_logged, rule_extracted, rule_applied, rule_disabled
system.*: startup, shutdown, health_check, backup.completed/failed, migration.applied
cost.*: daily_threshold, monthly_warning, agent_paused, budget_increase
sync.*: supabase.push, supabase.failed, keepalive.sent

Log Pipeline (Phase 1 — Dual Write)¶

Each service writes structured JSON to stdout. Docker captures via json-file log driver.
Promtail (in donna-monitoring.yml) tails Docker logs → ships to Loki.
Grafana queries Loki for real-time dashboard.
Simultaneously, lightweight log collector in orchestrator writes to SQLite log DB for programmatic access and retention management.

Dashboard Panels¶

System Health¶

Service status (green/yellow/red per container)
Last successful ops timestamps
NVMe disk usage breakdown
Memory/CPU per container
Circuit breaker state

Task Pipeline¶

Tasks created today/week (by channel, domain)
State distribution (backlog/scheduled/in_progress/blocked/done/cancelled)
Avg time-to-schedule
Reschedule frequency (3+ highlighted)
Dedup hit rate
Completion velocity trend

LLM & Cost¶

API calls per hour/day (by task type, model)
Token usage breakdown
Daily/weekly/monthly spend, burn rate, projected monthly, budget remaining
Latency p50/p95/p99 by task type
Error rate, retries, circuit breaker activations
Shadow mode comparison (when shadow key is set in routing config)

Agent Activity¶

Active agents: task, elapsed vs timeout
Completed today/week with cost and duration
Failed today with error summaries
Cost per agent type

Notifications¶

Messages sent (by channel, type)
Delivery failures
Escalation events
User response times (feeds preference learning)

Error Exploration¶

Filterable table by service, component, event type, time, severity
Error frequency timeline
Correlation trace: full request lifecycle across services
Stack trace viewer

Preference Learning¶

Corrections per week trend
Rules extracted, rules auto-disabled
Rule survival rate (% active after 30 days)

Alerting Rules¶

Condition	Action
Service down > 5 min	Discord #donna-debug webhook + SMS
> 10 errors in 5 min	Discord #donna-debug
Circuit breaker opened	Discord #donna-debug + SMS
NVMe disk > 80%	Discord #donna-debug
Supabase sync failure > 1 hour	Discord #donna-debug
No orchestrator heartbeat 10 min	External watchdog handles (see `docs/resilience.md`)