spec_v3.md — Canonical Design Document¶
Source of truth
This page embeds the authoritative spec_v3.md from the repo root.
Edits belong in
/spec_v3.md,
not here.
DONNA
AI Personal Assistant
Project Specification Document
Version 3.1 --- April 2026 (sync with implementation)
Classification: Personal / Confidential
Revision Notes (v3.1, April 2026)
v3.0 was the original design document (March 2026). v3.1 reconciles the spec with the production codebase after Phases 1--5 were built. The design intent is unchanged; substantive updates in v3.1 are:
- §3.2--3.3 The hybrid MCP strategy is deferred. Tier 1 (direct Python integration) is the sole pattern in production; FastMCP server and GitHub / Notes / Filesystem / Web Search (SearXNG) integrations are not implemented.
- §5.1 Task schema adds
nudge_count,quality_score,capability_name,inputs_json. - §5.4 Task Type Registry expanded with eight production task
types (
dedup_check,task_decompose,extract_preferences,generate_nudge,generate_reminder,challenge_task,generate_weekly_digest,classify_chat_intent). - §5.5.2 Priority escalation partially implemented (deadline + workload pressure only; dependency-chain escalation and user-lock flag not yet built).
- §6.1.2 / §6.2 Conflict-resolution matrix simplified to basic overlap detection; "Extended Work" and "Emergency Work" time windows are not configured.
- §7.1.1 Agent hierarchy adds Challenger Agent and Claude Novelty Judge (both enabled). Coding Agent and Communication Agent are defined in config but disabled pending Phase 4 Stage 3 tool progression.
- §11.1 Morning and end-of-day digests deliver over Discord (not email). The spec'd "Conflict Alert" notification type is not implemented.
- §12.1 Integration matrix reflects current reality: only Gmail, Google Calendar, Discord, Twilio, Supabase, and the internal SQLite DB are integrated. GitHub, Filesystem, Notes, and SearXNG are deferred.
- §14.3.1 The "logging database schema" is now split in
practice: structured per-invocation data lives in
invocation_log(SQLite); general service logs flow through Docker → Promtail → Loki. No separatedonna_logs.dbexists. - §16.3.2 Off-server backup push (GCS / Backblaze) is not implemented; backups remain on local NVMe with rotation.
- §20 Implementation-phase rewrite: Phases 1--5 are complete; Phase 6 work (GitHub/MCP rollout, Coding/Communication agents, Flutter production app) is next.
- §§23--29 (new) The following production subsystems were built beyond v3.0 and are now documented: Skills & Capabilities System (§23), Chat / Conversation Engine (§24), Automations Subsystem (§25), LLM Gateway & Queue (§26), Admin API & Dashboard (§27), Authentication & Access Control (§28), Setup Wizard (§29).
- §30 (new) Memory & Vault Subsystem documents the
Obsidian-compatible vault plumbing (slice 12), the
sqlite-vec-backed semantic memory layer (slice 13), the episodic
chat / task / correction sources (slice 14), and the
template-driven vault writes (slice 15):
MemoryStore,EmbeddingProvider,MarkdownHeadingChunker,VaultSource,ChatSource/TaskSource/CorrectionSource,VaultTemplateRenderer,MemoryInformedWriter, theMeetingNoteSkill+MeetingEndPollerreference trigger, and thememory_searchagent tool. The authoritative design lives atdocs/reference-specs/memory-vault-spec.md.
Everything else in v3.0 remains canonical design intent.
1. Executive Summary
Donna is an AI-powered personal assistant system designed to solve a specific, well-defined problem: the user consistently forgets to capture tasks, rarely consults existing task lists, and does not schedule dedicated time to complete work. The system is named after Donna Paulsen from the television show Suits and adopts her communication style --- sharp, confident, efficient, occasionally witty, and always one step ahead.
The system goes beyond passive task storage. It actively pursues the user, demands updates, reschedules dynamically, prepares for upcoming work, and progressively delegates tasks to autonomous sub-agents. Effectiveness is the single most important design criterion. Every architectural decision is evaluated against the question: does this help Nick get more done with less friction?
1.1 Core Objectives
-
Active Task Capture: Make it effortless to record tasks from any device or context, and proactively remind the user to capture tasks they may have forgotten.
-
Intelligent Scheduling: Dynamically schedule, reschedule, and prioritize tasks based on deadlines, inferred complexity, calendar constraints, and real-time changes.
-
Prep Work Automation: Perform research, compile information, and prepare deliverables before the user begins a task.
-
Autonomous Sub-Agents: Route eligible tasks to specialized AI agents that can assess requirements, request clarification, and complete work independently.
-
Persistent Follow-Up: Never let tasks silently expire. Escalate reminders across channels until the user acknowledges.
-
Adaptive Learning: Learn from user corrections and behavioral patterns to improve task classification, scheduling, and routing over time without model fine-tuning.
1.2 Key Constraints
-
Monthly API Budget: $100/month for Claude API (separate from Claude Pro subscription). Local LLM offloading deferred until GPU hardware secured.
-
Single User (Multi-User Designed): Built for the owner. Data model includes user_id from day one to support future multi-user deployment (Phase 3--4).
-
Platform: Windows workstation (i7-12th gen, 32GB, RTX 2080 Ti). Always-on Linux server (i7-6700K, 32GB 3200MHz, GTX 1080 dedicated to Immich/media). Android phone (Pixel 8 Pro). MacBook Pro for mobile.
-
GPU Strategy: RTX 3090 (to be acquired) dedicated to Donna and local LLM. GTX 1080 remains allocated to Immich and media services. No GPU sharing between workloads.
-
Storage: Dedicated 1TB NVMe for Donna --- database, workspace, agent outputs, backups, and logs. All Donna data lives on this volume.
-
Privacy: Sensitive data stays off accessible filesystem. Clean cloud storage or sandboxed local folders only.
1.3 Key Design Principles (v3)
The following principles were established during design review and govern all architectural decisions:
-
Config over code: Model selection, routing rules, prompt templates, task type definitions, and user preferences are stored as configuration data, not hardcoded in application logic. Adding or changing capabilities should not require code changes to the orchestrator.
-
Safety first, dial back later: All agents start with minimal autonomy and strict constraints. Trust is earned through logged, reviewed performance. Constraints are relaxed explicitly, never assumed.
-
Structured logging on every model call: Every LLM invocation (local or cloud) is logged with task type, model used, latency, tokens, cost, and output. This is the foundation for evaluation, cost tracking, and routing optimization.
-
Comprehensive observability: All services emit structured JSON logs to a dedicated logging database. A self-hosted dashboard provides real-time search, filtering, and alerting. Debugging any issue should never require SSH and grep.
-
Dev tool that evolves into runtime feature: The model comparison and evaluation system starts as a development tool for validating local LLM quality, but is designed with the data model and interfaces to evolve into runtime routing optimization.
-
Internal API over protocol overhead: Service integrations use a thin internal Python API layer for orchestrator-to-service calls. MCP is reserved for LLM-facing tool discovery where agents need to reason about available tools dynamically.
-
Clean seams, not frameworks: The system is designed for configurability where it matters (model providers, tool access, task types) without building abstract plugin architectures. Two implementations behind a clean interface, not a generic extension system.
2. Persona & Communication Style
The assistant's persona is modeled after Donna Paulsen from Suits. This is not cosmetic --- it directly affects user engagement and compliance with the system. The persona is implemented as a system prompt that governs all outbound communication.
2.1 Personality Traits
-
Confident and direct. Donna does not hedge. "You have three tasks overdue. I rescheduled two. The third needs your input."
-
Proactive. She anticipates needs. "I noticed your oil change has been rescheduled twice. I'm putting it on Saturday morning before it becomes a problem."
-
Witty but professional. Light humor is acceptable. Sarcasm when the user is behind on tasks is on-brand. Never sycophantic.
-
Efficient. Messages are concise. No filler. Bullet points and clear action items.
-
Loyal and protective of the user's time. She pushes back on overcommitment and flags when the schedule is unrealistic.
2.2 Communication Examples
Morning digest: "Good morning. You have 6 tasks today, 2 carry-overs from yesterday. Your 10am meeting prep is done --- check your email. Also, that invoice you've been avoiding? It's due Friday. I put it at 2pm."
Overdue nudge (SMS): "It's 3pm and you haven't touched the API refactor. Did you finish it or should I find time tomorrow?"
Budget warning: "Heads up --- agents burned through $22 today, mostly on the codebase analysis. I've paused all autonomous work. Want me to continue or are we done for the day?"
3. System Architecture
3.1 Architecture Overview
The system follows a hub-and-spoke architecture with a central orchestrator managing all task routing, scheduling, and agent coordination. The always-on Linux server acts as the system backbone, running the orchestrator, integration layer, and background agent workers. All services are deployed as Docker containers using the established homelab compose pattern.
3.2 Tool Integration Architecture
The original specification prescribed MCP (Model Context Protocol) as the sole tool integration layer. Industry developments in early 2026 have revealed significant trade-offs with an MCP-only approach. Donna adopts a hybrid architecture that uses the right integration pattern for each use case.
v3.1 Implementation Status: Only Tier 1 (direct Python API) is in production. Tier 2 (FastMCP server) and all MCP-wrapped integrations (GitHub, Filesystem, Notes, Web Search) are deferred to Phase 6. Production LLM tool access is instead provided by a small, internal tool-registry layer inside the Skills subsystem (see §23.3). The MCP design in §3.2.1--3.2.5 is retained as the target architecture for when Coding/Communication agents come online.
3.2.1 The MCP Context Cost Problem
MCP servers dump their entire tool schema into the LLM's context window on connection. A typical server with 20--90+ tools can consume 30,000--150,000+ tokens before any query is processed. Industry experience in 2025--2026 showed this overhead consuming 40--50% of available context windows. For Donna, where every token costs money against a $100/month budget, this overhead is unacceptable for internal integrations where the orchestrator (not the LLM) is making the call.
Mitigations exist: Claude's Tool Search feature (January 2026) reduces overhead by ~85% through deferred loading. FastMCP 3.x supports CodeMode, which exposes just search() and execute() meta-tools (~1,000 tokens instead of the full catalog). However, even with mitigations, MCP adds serialization overhead and protocol complexity that is unnecessary when the orchestrator already knows what tool to call with what parameters.
3.2.2 Hybrid Strategy
Donna separates tool usage into two tiers based on who is making the call:
Tier 1: Internal Python API (Primary)
All orchestrator-to-service integration uses thin Python modules. The orchestrator calls functions directly --- no protocol overhead, no schema in context windows. The LLM outputs structured JSON; the orchestrator maps fields to API calls. Zero tokens consumed for tool definitions.
Tier 2: MCP Endpoint (LLM-Facing + External Clients)
MCP via FastMCP 3.x is used when agents need dynamic tool discovery during reasoning (Research Agent deciding which tools to use, Coding Agent exploring a repo). Also maintained as a Streamable HTTP endpoint for the Flutter app, Claude Desktop, and future third-party clients.
Decision framework for each integration:
Integration Pattern Rationale
Google Calendar Direct API Orchestrator calls with known params. API (Python client) No discovery needed.
SQLite Task DB Direct API Internal data store. MCP wrapper (aiosqlite) would add pure overhead.
Discord Bot Direct API Bidirectional messaging. Bot (discord.py) framework handles natively.
Gmail API Direct API Orchestrator reads/drafts with known (Python client) scopes.
Twilio SMS/Voice Direct API Outbound notifications with fixed (Python client) parameters.
Supabase Sync Direct API Background sync process with fixed (supabase-py) schema.
GitHub MCP (FastMCP) Coding Agent explores repos and issues dynamically.
Web Search MCP (FastMCP) Research Agent discovers and invokes search dynamically.
Filesystem MCP (FastMCP) Agents need to discover and navigate (sandboxed) files dynamically.
Notes (Local MCP (FastMCP) Agents need to discover and read Markdown) notes dynamically.
3.2.3 FastMCP Server (Python)
The MCP server is implemented in Python using FastMCP 3.x, keeping the entire stack in a single language. It exposes tools that agents need during LLM-driven reasoning. The server uses FastMCP's CodeMode transform to minimize context window consumption.
Key design principles: tool granularity (each action is a separate tool for fine-grained access control per agent and per task type), centralized authentication (all OAuth tokens, API keys stored in MCP server config, never passed to agents), audit logging (every tool invocation logged with timestamp, calling agent, parameters, and result), rate limiting (per-tool limits to prevent runaway agents), and tool registry as config (adding a new tool = implementation + config entry, orchestrator discovers tools at startup).
3.2.4 Integration Modules
integrations/
├── calendar.py ← Google Calendar (read-write personal, read all) [IMPLEMENTED]
├── gmail.py ← Gmail (read + draft; send behind feature flag) [IMPLEMENTED]
├── discord_bot.py ← Send/read in Donna channels [IMPLEMENTED]
├── discord_commands.py ← Slash commands, pending-draft feed [IMPLEMENTED]
├── discord_views.py ← Interactive components (buttons/selects) [IMPLEMENTED]
├── twilio_sms.py ← SMS (outbound + inbound webhook) [IMPLEMENTED]
├── twilio_voice.py ← Voice (TTS escalation, Tier 4, disabled) [IMPLEMENTED but off by default]
├── supabase_sync.py ← Async write-through sync + keep-alive [IMPLEMENTED]
├── email_parser.py ← Parse forwarded email → task / correction [IMPLEMENTED]
├── github.py ← GitHub (MCP-wrapped) [DEFERRED Phase 6]
├── filesystem.py ← Sandboxed /donna/workspace/ [DEFERRED Phase 6]
├── notes.py ← Local markdown notes [DEFERRED Phase 6]
├── search.py ← Web search (SearXNG/API) [DEFERRED Phase 6]
└── mcp_wrapper.py ← FastMCP Streamable HTTP [DEFERRED Phase 6]
Each implemented module: centralized auth (env vars + config/auth.yaml),
structured audit logging, rate limiting where applicable, per-agent
access control via task-type tool allowlists.
3.2.5 Adopt Before Building
Before implementing custom MCP tools, evaluate existing open-source MCP servers from the community (e.g., google-calendar-mcp, GitHub MCP server). If a community server covers 80%+ of needed functionality, adopt it and extend rather than building from scratch. The FastMCP framework's composability model supports mounting external servers alongside custom tools.
3.3 Component Map
Component Location Purpose
Orchestrator Linux Server Central brain. Manages task queue, Service (Docker) scheduling engine, agent dispatch, and cost monitoring. Runs 24/7.
Claude API Cloud Primary LLM for all reasoning: task (Anthropic) parsing, classification, routing, code generation, prep work, scheduling decisions. Sole provider until local LLM hardware is available.
Local LLM Linux Server DEPLOYED. Ollama container running (Ollama) (Docker) --- (qwen2.5:32b-instruct-q6_K). Handles RTX 3090 nudge/reminder generation, chat-intent classification, and Stage 1 read-only tool calls. Claude remains the primary parser/planner.
Integration Linux Server Internal Python API wrapping external Layer (Docker) services actually in use (Gmail, Calendar, Discord, Twilio, Supabase). Centralized auth, audit logging.
FastMCP Server Linux Server DEFERRED (Phase 6). The design is (Docker) retained; in production, LLM-facing tool access is provided by the internal Skills tool registry (§23.3).
Task Database SQLite on NVMe Primary task storage (donna_tasks.db) (donna_tasks.db) with full metadata. WAL mode. Also hosts invocation_log, correction_log, learned_preference, skill/capability tables, automation tables, chat_session tables, and auth/device-token tables (see §16.1 and Alembic migration list).
Log Storage Loki (Docker) + The standalone "donna_logs.db" in
invocation_log v3.0 was not built. Structured
service logs ship via stdout → Docker
json-file → Promtail → Loki.
Per-LLM-call records live in the
invocation_log SQLite table.
Sync Replica Supabase Cloud replica for cross-device access. (Postgres) Free tier with keep-alive; upgrade to Pro at Phase 4 or multi-user.
Observability Grafana + Loki Self-hosted. Real-time log search, Dashboard (Docker) filtering, metrics, and alerting. Phase 1 deliverable.
Notification Linux Server Outbound communication: email (Gmail Service (Docker) API/SMTP), SMS (Twilio), phone (Twilio TTS), push (FCM), Discord bot.
Agent Worker Linux Server Sub-agents run inside the orchestrator
Pool (orchestrator process (asyncio tasks), not isolated
process) containers. Per-agent tool allowlists
defined in config/agents.yaml are
enforced by the Skills tool registry.
Container isolation is Phase 6.
Skills & Cap. Linux Server Execution engine for user-defined System (orchestrator) capabilities and their concrete skill implementations (DSL-based). Shadow sampling, evolution, divergence tracking. See §23.
LLM Gateway Linux Server Cost-aware priority queue for all (orchestrator) outbound LLM calls (Claude + Ollama). Preemption, rate limiting, budget alerts. See §26.
Admin API Linux Server FastAPI service (port 8200) exposing (Docker) dashboards, logs, skill pipeline, agent state, task/preference CRUD. See §27.
Web/Mobile App Firebase Backend API is live; Flutter client
Hosting + lives in separate donna-app/ repo.
Flutter Full production UI is Phase 6.
3.4 Data Flow
All task inputs (SMS, Discord, Slack, app, email forwarding) are normalized into a standard task schema by the Input Parser. During Phase 1--2 (pre-local LLM), parsing runs on Claude API. Once the RTX 3090 is available and the local model is validated, high-frequency parsing shifts to the local LLM with Claude API as fallback. The Orchestrator evaluates each task against the scheduling engine and decides routing. Agent outputs are stored, reviewed, and surfaced through the notification service.
3.5 Infrastructure & Deployment
Donna deploys on the existing homelab Docker Compose infrastructure. Each Donna service gets its own compose file following the established multi-file pattern, attaching to the shared homelab network. This means Donna services can communicate with existing services (if needed) and new Donna components can be added without modifying existing stack files.
3.5.1 Docker Compose Structure
docker/
├── .env.example ← copy to .env (gitignored)
├── core.yml ← shared homelab network
├── immich.yml ← Immich stack (GTX 1080)
├── donna-core.yml ← Orchestrator, integration layer, notification service
├── donna-monitoring.yml ← Grafana, Loki, Promtail (dev dashboard)
├── donna-ollama.yml ← Ollama + local LLM (RTX 3090, added post-GPU)
└── donna-app.yml ← FastAPI backend (Flutter app connects here)
3.5.2 GPU Isolation
GPU assignment is managed through environment variables in docker/.env, consistent with the existing homelab pattern. Each GPU-using service references its own variable. No compose file changes are needed when hardware changes.
# docker/.env
IMMICH_ML_GPU_ID=0 # GTX 1080 --- dedicated to Immich/media
DONNA_OLLAMA_GPU_ID=1 # RTX 3090 --- dedicated to Donna LLM
This isolation ensures no VRAM contention between Immich's ML pipeline and Donna's local LLM inference. The 3090's 24GB VRAM provides substantial headroom for running larger quantized models or multiple models concurrently if needed.
3.5.3 NVMe Storage Layout
The dedicated 1TB NVMe is mounted and organized as Donna's complete data volume:
/donna/
├── db/
│ ├── donna_tasks.db ← Primary SQLite database (tasks, │ │ invocation_log, skills, automations, │ │ chat sessions, auth)
│ ├── donna_logs.db ← (design-only; not created in v3.1 — │ │ see §14.3)
│ └── donna_eval.db ← (design-only; not created in v3.1 — │ see §16.1)
├── workspace/ ← Agent sandboxed working directory
├── backups/
│ ├── daily/ ← 7-day retention
│ ├── weekly/ ← 4-week retention
│ ├── monthly/ ← 3-month retention
│ └── offsite/ ← Staging for cloud backup sync
├── logs/
│ └── archive/ ← Compressed historical log exports
├── config/
│ ├── donna_models.yaml ← Model routing configuration
│ ├── task_types.yaml ← Task type definitions
│ ├── task_states.yaml ← State machine transitions
│ └── preferences.yaml ← Learned preference rules
├── prompts/ ← Externalized prompt templates
├── fixtures/ ← Evaluation test fixtures (version-controlled)
└── models/ ← Ollama model cache (Phase 3+)
3.6 API Resilience Layer
Every Claude API call goes through a resilience wrapper handling retries, degraded mode fallback, and circuit breaking. Phase 1 is entirely dependent on Claude API availability; the resilience layer ensures Donna degrades gracefully rather than failing silently.
3.6.1 Retry Policies
Task Category Max **Backoff On Failure Retries**
Critical (digest, 3 Exponential, 2s Fall back to degraded deadline start, 30s cap mode (template-based) reminders)
Standard (parse, 2 Exponential, 1s Queue for retry on next classify) start, 15s cap cycle; notify user of delay
Agent work 1 5s fixed Mark agent_status = (research, code failed; notify user; do gen) not retry (budget protection)
3.6.2 Degraded Mode Definitions
-
Morning digest: Generate a template-based digest using raw calendar data and task list from SQLite. No LLM reasoning, no Donna persona. Format: "Today's schedule: [list of events]. Tasks due: [list]. Note: AI digest unavailable, showing raw data." This ensures the user always gets their morning overview even if Claude is down.
-
Reminders: Send with static template: "Reminder: [task title] starts at [time]." Functionality preserved, personality lost --- acceptable trade-off.
-
Task parsing: Accept raw text as-is, create a task with title = raw input, all other fields set to defaults, and flag it for re-parsing when the API recovers. Never lose a task capture because the LLM is unavailable.
3.6.3 Circuit Breaker
If 5 consecutive API calls fail within a 10-minute window, the orchestrator enters circuit-breaker mode: pause all non-critical agent work, switch all critical paths to degraded mode, send the user a single SMS notification ("Donna's AI is temporarily unavailable. Reminders and captures are running in basic mode. I'll notify you when it's restored."). The circuit breaker tests recovery every 5 minutes with a lightweight health-check call. Resets on first successful response.
3.6.4 Response Validation
Every API response is validated against the expected output schema before being used. Malformed JSON, missing required fields, or schema mismatches trigger a retry (counted against the retry budget). This catches partial responses from timeouts or rate-limited truncations.
3.7 Concurrency Model
3.7.1 Phase 1--2: Single-Threaded Asyncio Event Loop
The orchestrator is a single Python process running an asyncio event loop. All I/O (Discord bot, API calls, SQLite reads/writes, calendar API) is async. Concurrency comes from I/O multiplexing, not parallelism.
-
SQLite access is serialized through a single async connection (aiosqlite) with WAL (Write-Ahead Logging) mode enabled. WAL allows concurrent reads with a single writer, which is exactly the access pattern: the orchestrator writes, and the Supabase sync process reads.
-
Calendar write operations are serialized through an async queue. If two tasks need to create calendar events simultaneously, they are queued and processed sequentially. This prevents double-booking race conditions without complex locking.
-
Task state transitions are atomic: read current state, validate transition, write new state, execute side effects --- all in a single async function with a SQLite transaction. No interleaving possible.
3.7.2 Phase 3+: Task Queue with Worker Pool
When agents need to run in parallel (Coding Agent and Research Agent working on different tasks simultaneously), add a task queue (asyncio.Queue or a lightweight broker like arq backed by Redis). The orchestrator dispatches to the queue; worker processes pull tasks and execute independently. Shared state (task DB) is accessed through the orchestrator's internal API, not directly by workers. This prevents workers from making conflicting state changes.
Agent isolation: each agent worker is a separate Docker container (or at minimum a separate process) with its own tool access scope. Workers communicate with the orchestrator via an internal API, not by directly modifying shared state.
3.8 Schema Migration
All SQLite tables are defined using SQLAlchemy models. Alembic manages schema evolution for both the task database and the logging database. Migration files are version-controlled in alembic/versions/ in the repo.
-
On orchestrator startup: run alembic upgrade head to apply any pending migrations. If the DB is fresh, this creates all tables. If existing, applies only new migrations.
-
Every schema change (new field, type change, enum expansion, index addition) gets its own migration file with upgrade() and downgrade() functions.
-
Never modify existing migration files --- only add new ones.
-
For the Supabase Postgres replica: maintain a parallel set of migrations or use the same Alembic config with a Postgres connection string. Schemas should match, but Postgres-specific features (Row Level Security) get their own migration files.
-
Pre-migration backup: the Alembic migration runner automatically creates a SQLite backup before applying any migration. If the migration fails, the backup is the rollback path.
-
Testing: before applying a migration to the production DB, test it against a copy. The backup strategy (Section 16) makes this easy.
4. Model Abstraction & Evaluation Layer
The model layer is designed around a core principle: the orchestrator and agents never call a specific model provider directly. All LLM interactions go through a standardized interface that handles provider abstraction, structured logging, routing decisions, and shadow evaluation.
4.1 Model Interface
Every model call goes through a single function signature:
complete(prompt, schema, model_alias) → (response, metadata)
The metadata object always includes: latency_ms, tokens_in, tokens_out, cost_usd, model_actual (resolved provider + model name), and whether it was a shadow run. Two implementations exist behind this interface: AnthropicProvider (Claude API) and OllamaProvider (local LLM). A third provider can be added when needed without changing any calling code.
4.2 Routing Configuration
The routing table maps model aliases to providers and defines per-task-type behavior. This is the primary configuration surface for controlling which model handles what work.
# donna_models.yaml
models:
parser:
provider: anthropic # ollama once 3090 available
model: claude-sonnet-4-20250514
reasoner:
provider: anthropic
model: claude-sonnet-4-20250514
fallback:
provider: anthropic
model: claude-sonnet-4-20250514
routing:
task_parse:
model: parser
fallback: reasoner
confidence_threshold: 0.7
priority_classify:
model: parser
fallback: reasoner
confidence_threshold: 0.7
prep_research:
model: reasoner
code_generation:
model: reasoner
morning_digest:
model: parser
shadow: reasoner # production monitoring: run secondary model, log only
During Phase 1 (Claude API only), all model aliases point to the Anthropic provider. When the local LLM becomes available, switching a task type to local requires changing only the provider and model fields for the relevant alias. The shadow key enables production monitoring (secondary model runs in parallel, output logged but not used). Offline evaluation is triggered via CLI with an explicit model argument, not configured in routing.
4.3 Structured Invocation Logging
Every model call is logged to the invocation_log table. This is the foundation for cost tracking, evaluation, and future routing optimization.
Field Type Purpose
id UUID Unique invocation identifier
timestamp DateTime When the call was made
task_type String Which task type (parse, classify, generate, etc.)
task_id UUID? Associated task if applicable
model_alias String Config alias used (parser, reasoner, etc.)
model_actual String Resolved provider + model (anthropic/claude-sonnet-4-20250514)
input_hash String Hash of input for dedup and comparison matching
latency_ms Int Wall clock time for the call
tokens_in Int Input tokens consumed
tokens_out Int Output tokens generated
cost_usd Float Computed cost ($0.00 for local models before cost approx configured)
output JSON The actual structured response
quality_score Float? Nullable. Filled by spot-check batch job or offline eval
is_shadow Boolean Whether this was a shadow run (production monitoring) or eval run (offline comparison)
eval_session_id UUID? Groups invocations from a single evaluation run for session-based comparison
spot_check_queued Boolean Whether this invocation is queued for Claude-as-judge review
user_id String User who triggered the call
4.4 Shadow Mode (Production Monitoring)
Shadow mode runs a secondary model on the same input in production without affecting the primary output. The primary model's response is used for all downstream processing; the shadow model's response is logged to the invocation_log for comparison. This is a runtime feature for ongoing quality monitoring after a task type has been migrated to a different model.
Use case: After migrating task_parse from Claude to a local model, keep Claude as a shadow for 2--4 weeks to monitor whether the local model's quality holds up on real production inputs. If shadow comparison shows quality degradation, revert the migration by changing the routing config.
Cost implication: Shadow mode doubles the model cost for that task type (two calls per input). It is intended as a temporary monitoring tool, not a permanent configuration. Disable once confidence is established.
4.5 Offline Evaluation Harness (Model Comparison)
The evaluation harness is a development tool for comparing multiple models against the same test inputs. It is triggered via CLI, not part of production routing. Its primary purpose is model selection: determining which local LLM, quantization level, and parameter size best fits each task type on available hardware.
4.5.1 Tiered Test Fixtures
Fixtures are organized into complexity tiers. Each tier builds on the previous; if a model fails Tier 1, there is no need to continue to higher tiers. This saves time when evaluating models that clearly are not suitable. Fixtures are version-controlled in the repo so collaborators with different hardware run identical evaluations.
fixtures/
├── parse_task/
│ ├── tier1_baseline.json # ~10 cases: simple, unambiguous inputs
│ ├── tier2_nuance.json # ~15 cases: implicit deadlines, domain ambiguity
│ ├── tier3_complexity.json # ~10 cases: multi-part tasks, dependencies
│ └── tier4_adversarial.json # ~5 cases: edge cases, contradictions, buried tasks
├── classify_priority/
│ ├── tier1_baseline.json
│ └── tier2_nuance.json
├── generate_digest/
│ └── tier1_baseline.json # with quality rubrics for subjective evaluation
├── deduplication/
│ └── tier1_baseline.json # exact dupes, reformulations, related-but-distinct
├── escalation_awareness/ # cross-cutting evaluation dimension
│ ├── should_escalate.json # tasks the local model should NOT attempt
│ └── should_handle.json # tasks the local model should handle (false positive check)
└── instruction_following/ # cross-cutting evaluation dimension
├── claude_decomposition.json # Claude-generated subtask instructions
├── constraint_compliance.json # multi-constraint classification directives
└── correction_application.json # apply a learned correction rule to new inputs
4.5.2 Tier Definitions
Tier Name Cases Purpose Pass Gate
1 Baseline ~10 Simple, unambiguous inputs any 90%+ accuracy to
reasonable model should handle. continue
\"Buy milk,\" \"pay electric
bill by Friday.\" Quick
pass/fail gate.
2 Nuance ~15 Inputs requiring inference: 80%+ accuracy
implicit deadlines (\"before
the holidays\"), domain
ambiguity (\"fix the leak\"),
priority signals (\"this is
urgent\" vs \"whenever\").
3 Complexity ~10 Multi-part tasks, dependency 60%+ accuracy
implications, tasks benefiting
from tool use. \"Refactor the
auth module before the API
launch next month.\"
4 Adversarial ~5 Edge cases: ambiguous inputs, No gate ---
contradictions, non-tasks that diagnostic only
look like tasks, long freeform
messages with a task buried in
them. Tests graceful failure.
Fixtures grow over time. When user corrections or spot-checks reveal an interesting failure case, that input/correction pair is added to the appropriate tier. The evaluation harness becomes more comprehensive as the system is used.
4.5.3 Sequential Evaluation (One Model at a Time)
A single GPU can only run one model at a time. Ollama loads models into VRAM on request and unloads when a different model is requested. The evaluation harness runs sequentially: load model A, run all fixtures through it, save results, then swap to model B and repeat. Models are not run in parallel.
Triggered via CLI:
donna eval --task-type task_parse --model ollama/llama3.1:8b-q4
This loads the specified model, runs it against all fixture tiers (stopping early if a tier fails its pass gate), and saves the results as a model session. To compare, run the command again with a different model at a later time. The comparison is across saved sessions, not simultaneous runs.
The typical workflow is: install the largest model that fits in VRAM, evaluate it, then try progressively smaller or differently quantized models to understand the quality/speed tradeoff for each task type.
4.5.4 Model Sessions
Each evaluation run is saved as a model session --- a record of which model was tested, when, on which hardware, and the results across tiers and dimensions. Sessions persist in the evaluation database so comparisons can be made across days or weeks as different models are tested. Collaborators with different hardware run the same fixtures and share their session results. The fixtures are the shared contract; the model selection and session data are per-environment.
4.5.5 What the Harness Answers
-
Quantization tradeoffs: Does Q6 perform measurably better than Q4 for task parsing, or is the quality difference negligible for the latency cost?
-
Parameter scaling: Is 8B sufficient for priority classification, or does a 13B model meaningfully improve accuracy on Tier 2--3 cases?
-
Speed vs quality: Is a smaller, faster model (Phi-3 3.8B) adequate for simple classification tasks where only Tier 1 accuracy matters?
-
Hardware fit: Which models fit within the VRAM constraints of different GPUs (8GB 1080, 12GB 4070, 24GB 3090) while still passing tier gates?
-
Task-type specialization: Could a smaller model handle parsing (Tier 1--2 sufficient) while a larger model is reserved for complex reasoning (Tier 3--4 critical)?
-
Multi-model coordination: Can the model recognize tasks beyond its capability and follow structured instructions from Claude effectively enough to participate in a hub-and-spoke workflow?
4.5.6 Evaluation Dimensions: Escalation Awareness & Instruction Following
In addition to the complexity tiers (which measure how well a model handles increasingly difficult tasks), two cross-cutting evaluation dimensions measure how well a model operates as a subordinate in a multi-model system. These are critical because Donna's architecture relies on the local model knowing its limits and taking direction from Claude.
Escalation Awareness --- "I shouldn't handle this."
This measures whether the model recognizes that a task is beyond its capability before producing output. This is distinct from confidence scoring, which measures uncertainty about the output after the model has already attempted it. Escalation awareness is about the model knowing when not to try.
The escalation fixtures contain two sets: should_escalate.json (tasks the local model should NOT attempt, e.g., multi-step research, code review requiring full codebase understanding, ambiguous requests requiring judgment) and should_handle.json (tasks the local model should handle confidently, checking for over-escalation).
Metric Definition Target Why This Threshold
Precision Of tasks the model 85%+ Over-escalation wastes money (correctly escalated, what % but produces correct results. escalated) truly needed Tolerable. Claude?
Recall (caught Of tasks that 85%+ Under-escalation produces tasks it needed Claude, what garbage output. Less shouldn't % did the model tolerable. Err toward handle) flag? escalating.
False positive \% of handleable \< 25% Above this, the cost savings rate tasks unnecessarily of local LLM are undermined. escalated
The asymmetry is deliberate: under-escalation (model attempts a task it can't handle) produces bad output that may go undetected. Over-escalation (model sends a simple task to Claude) costs more but produces correct results. This aligns with the safety-first principle --- better to spend an extra few cents than to silently produce wrong answers.
Instruction Following --- "Claude told me how to do this."
This measures the model's ability to operate as a subordinate in a multi-model chain. The realistic scenario: Claude decomposes a complex task into subtasks with specific, structured instructions, and the local model executes each subtask. Or Claude provides a correction with guidance, and the local model applies that guidance to new inputs.
The instruction-following fixtures include three categories: claude_decomposition.json (can the model execute subtasks as specified by Claude?), constraint_compliance.json (does the model apply all constraints or silently drop some?), and correction_application.json (given a learned correction rule, does the model apply it correctly and ignore it when irrelevant?).
Metric Definition Target
Constraint Out of N constraints in the 90%+
compliance instruction, how many were
satisfied?
Format adherence Did the output match the 95%+ requested schema/structure?
Rule application When a correction rule applies, 85%+ accuracy was it applied correctly?
Rule false When a correction rule does NOT \< 10%
application apply, was it incorrectly
triggered?
A model that scores well on complexity tiers but poorly on instruction following is useful for independent tasks but not for Claude-directed workflows. A model that scores well on instruction following but poorly on complexity is ideal as a Claude subordinate but should not be trusted for autonomous work. The evaluation harness reveals these profiles so routing decisions can be made accordingly.
4.5.7 Model Session Output
Model sessions include tier results and dimension scores, providing a complete profile of each model's capabilities:
Model Session: llama3.1-8b-q4 (2026-03-15, RTX 3090)
Task Parse Tiers:
Tier 1: 10/10 (100%) avg latency: 180ms
Tier 2: 12/15 (80%) avg latency: 340ms
Tier 3: 6/10 (60%) avg latency: 890ms
Escalation Awareness:
Precision: 88% Recall: 82% False positive: 18%
Instruction Following:
Constraint compliance: 91% Format adherence: 96%
Rule application: 87% Rule false application: 8%
---
Model Session: mistral-7b-q4 (2026-03-20, RTX 3090)
Task Parse Tiers:
Tier 1: 10/10 (100%) avg latency: 150ms
Tier 2: 11/15 (73%) avg latency: 290ms
Tier 3: 5/10 (50%) avg latency: 750ms
Escalation Awareness:
Precision: 92% Recall: 70% False positive: 12%
Instruction Following:
Constraint compliance: 84% Format adherence: 90%
Rule application: 79% Rule false application: 15%
This tells you: Llama 3.1 8B is a better Claude subordinate (higher instruction compliance, better recall on escalation). Mistral is more precise about when to escalate (fewer false positives) but misses more tasks it should escalate (lower recall) and is weaker at following multi-constraint directives. For a safety-first deployment, Llama's profile is preferable despite being slower.
4.6 Spot-Check Quality Monitoring
Spot-checks are periodic quality audits of production model outputs using Claude-as-judge. They are most valuable in Phase 3+ when the local LLM is handling production traffic. In Phase 1 (Claude-only), spot-checks are not active since Claude would be evaluating its own output.
4.6.1 Configuration
quality_monitoring:
spot_check_rate: 0.05 # 5% of production calls sampled
judge_model: reasoner # which model evaluates
judge_batch_schedule: weekly
flag_threshold: 0.7 # scores below this are flagged
enabled: false # disabled in Phase 1, enable in Phase 3
The orchestrator rolls a random number on each production invocation. If the roll falls within the spot_check_rate, the output is queued for batch review. The batch job runs on the configured schedule and sends queued outputs to Claude with a judging prompt. Results are written back to the invocation_log's quality_score field.
4.6.2 Flagged Output Handling
When a spot-check scores below the flag_threshold, the system does not require a separate review UI. Instead, it creates a Donna task:
Task: \"3 low-quality parses flagged this week. Review and provide corrections.\"
Domain: work
Priority: 2
Notes: [links to relevant invocation_log entries]
The user reviews the flagged outputs, provides corrections, and those corrections flow into the correction log (Section 9). This keeps quality improvement in the same workflow as everything else --- Donna managing its own quality as a task in the system it manages.
4.6.3 Cadence Tuning
During early local LLM deployment, set spot_check_rate higher (0.10--0.20) to get fast signal on quality. As confidence builds and corrections decrease, dial it back to 0.02--0.05. The rate is a config value, adjustable at any time without code changes. If a model change is deployed, temporarily increase the rate to validate the new model's production quality.
4.7 Confidence Scoring
For routing decisions that depend on model confidence (e.g., falling back from local to Claude), two approaches are supported. Confidence scoring is relevant in Phase 3+ when the local LLM is handling production traffic.
-
Self-assessed confidence (default): Include a confidence field (0.0--1.0) in the structured output schema. The model rates its own certainty. Simple to implement, effective for detecting "I don't know" cases, less reliable for "confidently wrong" detection.
-
Logprob-based scoring (optional upgrade): When using Ollama, examine average token logprobs on structured output. Low confidence in tokens correlates with low confidence in the parse. Requires post-processing of Ollama API response.
Recommendation: start with self-assessed confidence, log actual accuracy against test fixtures, and correlate the two. Move to logprob-based scoring only if self-assessment proves unreliable for specific task types.
Relationship to escalation awareness: Confidence scoring and escalation awareness (Section 4.5.6) are complementary. Confidence scoring measures per-output uncertainty after the model has attempted a task. Escalation awareness measures whether the model recognizes a task is beyond its capability before attempting it. A well-functioning system uses both.
4.8 Local Model Cost Approximation
To enable meaningful cost comparison between local and cloud models, the config supports an estimated cost per 1K tokens for local models:
models:
parser:
provider: ollama
model: llama3.1:8b-q4
estimated_cost_per_1k_tokens: 0.0001 # approx from hardware amortization
This ensures the cost dashboard never shows local inference as "free" and enables genuine cost-per-quality analysis.
5. Task Management System
5.1 Task Schema
Every task is represented by the following schema. Fields marked auto-populated are inferred by the system; the user only needs to provide natural language input. The user_id field is included from day one to support future multi-user deployment.
Field Type Source Description
id UUID Auto Unique task identifier
user_id String Auto Owner of the task. Defaults to primary user. Enables future multi-user.
title String User/Inferred Task title, extracted from natural language input
description String User/Agent Detailed description. May be populated by PM agent interrogation.
domain Enum Inferred personal | work | family (extensible)
priority Int (1--5) Inferred/User 1 = lowest, 5 = critical. Auto-inferred from deadline proximity, keywords, domain.
status Enum Auto backlog | scheduled | in_progress | blocked | waiting_input | done | cancelled
estimated_duration Minutes Inferred How long the task will take. Inferred from complexity analysis.
deadline DateTime? User/Inferred Hard deadline if specified. Null if flexible.
deadline_type Enum Inferred hard | soft | none
scheduled_start DateTime? Scheduler When the task is scheduled on the calendar
actual_start DateTime? Auto When the user actually started
completed_at DateTime? Auto Completion timestamp for velocity tracking
recurrence Cron/RRULE? User Recurrence pattern if applicable
dependencies UUID[] User/Agent Tasks that must complete before this one can start
parent_task UUID? Agent Parent task if this is a subtask
prep_work_flag Boolean User Whether prep work should be performed before scheduled time
prep_work_instructions String? User What the assistant should prepare
agent_eligible Boolean Inferred/User Whether this task can be delegated to a sub-agent
assigned_agent String? Orchestrator Which agent is handling this task
agent_status Enum? Agent pending | gathering_requirements | in_progress | review | complete | failed
tags String[] User/Inferred Freeform tags for filtering and grouping
notes String[] User/Agent Running notes and context
reschedule_count Int Auto How many times rescheduled. Triggers priority escalation.
created_at DateTime Auto Creation timestamp
created_via Enum Auto sms | discord | slack | app | email | voice
estimated_cost Float? Auto Estimated API cost if agent-eligible
calendar_event_id String? Auto Google Calendar event ID for sync tracking
donna_managed Boolean Auto Whether Donna created and manages this calendar event
nudge_count Int Auto Nudges/reminders sent for this task. Drives escalation backoff (added v3.1).
quality_score Float? Auto Spot-check quality rating from Claude-as-judge, when the spot-check monitor flags this task's output (added v3.1).
capability_name String? Orchestrator Capability matched by the intent dispatcher (§23.2). Drives skill routing. Null for free-form tasks (added v3.1).
inputs_json JSON? Orchestrator Extracted inputs dict matching the capability's input schema. Fed to the skill executor (added v3.1).
5.2 Task Lifecycle State Machine
Explicit state machine defined in task_states.yaml. Orchestrator loads at startup, rejects invalid transitions. Each transition specifies triggers and side effects. This prevents ad-hoc transition logic from scattering across the codebase and ensures consistent state handling.
5.2.1 Valid Transitions
From To Trigger Side Effects
backlog scheduled Scheduler assigns time Calendar event created; slot calendar_event_id stored; donna_managed = true
scheduled in_progress User acknowledges actual_start timestamp start OR scheduled set time arrives
scheduled backlog User cancels scheduled Calendar event deleted; time, no new time reschedule_count++ requested
in_progress done User/agent reports completed_at set; completion velocity metrics updated
in_progress blocked User/agent reports Dependencies updated; blocker blocking reason logged; dependent tasks notified
in_progress scheduled User requests New time slot assigned; reschedule reschedule_count++; calendar event updated
blocked scheduled Blocker resolved Scheduler finds next (dependency completed available slot or user unblocks)
blocked cancelled User decides to Dependent tasks flagged abandon blocked task for review
waiting_input scheduled User/agent provides PM Agent updates task; required information scheduler assigns slot
waiting_input cancelled No response after User notified; task configurable timeout archived (default 7 days)
any cancelled User explicitly Dependent tasks cancels flagged; calendar event deleted if exists
done in_progress User reopens a completed_at cleared completed task
5.2.2 Invalid Transitions (Enforced at Orchestrator Level)
-
backlog → done: Cannot complete without scheduling. Must go through scheduled → in_progress → done.
-
cancelled → any state except backlog: Must be explicitly re-opened to backlog first.
-
done → scheduled: Must go through in_progress first for reopening.
5.3 Task Deduplication
Two-pass deduplication prevents duplicate task creation without blocking the capture pipeline.
5.3.1 Pass 1: Fuzzy Title Match
Uses rapidfuzz (Python library, fast C implementation) with token-sort ratio. This catches simple reformulations like "get oil change" vs "oil change needed." Cost: zero (local computation). Applied to all incoming tasks.
-
Above 85% similarity: auto-flag as duplicate, ask user to confirm merge.
-
Below 70% similarity: clearly different, no further check.
-
70--85% range: proceed to Pass 2 for LLM arbitration.
5.3.2 Pass 2: LLM Semantic Comparison
For candidates in the 70--85% fuzzy range, send both task descriptions to the LLM with a structured prompt: "Are these the same task? Respond with: same (merge), related (link but keep separate), or different (no relation)." This catches "oil change for car" vs. "oil change for lawn mower" (different) and "send the invoice" vs. "email that invoice to the client" (same).
5.3.3 User Flow
When a duplicate is detected, the user is prompted on the same channel: "This looks like a duplicate of '[existing task title]' (created [date]). Should I merge them, keep both, or update the existing one?"
Dedup accuracy is tracked in evaluation fixtures (deduplication/tier1_baseline.json). Track false positive rate (incorrectly flagged as duplicate) and false negative rate (missed actual duplicate) over time.
5.4 Task Type Registry
Task types define how the system processes different categories of work. Each task type is a configuration entry specifying a prompt template, output schema, model assignment, and tool dependencies. Adding a new task type requires only config and (when a new tool is needed) a tool implementation --- no orchestrator code changes.
# task_types.yaml
task_types:
parse_task:
description: \"Extract structured task fields from natural language\"
model: parser
prompt_template: prompts/parse_task.md
output_schema: schemas/task_parse_output.json
tools: []
classify_priority:
description: \"Assign priority 1-5 based on content and context\"
model: parser
prompt_template: prompts/classify_priority.md
output_schema: schemas/priority_output.json
tools: [task_db_read]
generate_digest:
description: \"Generate morning digest in Donna persona\"
model: parser
shadow: reasoner
prompt_template: prompts/morning_digest.md
output_schema: schemas/digest_output.json
tools: [calendar_read, task_db_read, cost_summary]
prep_research:
description: \"Research and compile prep materials\"
model: reasoner
prompt_template: prompts/prep_research.md
output_schema: schemas/prep_output.json
tools: [web_search, email_read, notes_read, fs_read]
Prompt templates are externalized as files, enabling per-model tuning. Different models may use different prompt formats (e.g., Llama 3.1 vs. Mistral), and templates can include few-shot examples that accumulate over time from the correction log.
Schema versioning: Output schemas use semantic versioning (e.g., task_parse_output_v2.json). When a schema changes, the orchestrator handles both old and new formats during the transition period.
Production task types (v3.1): The examples above are the
original v3.0 entries. config/task_types.yaml also contains the
following production task types:
Task type Model Purpose
dedup_check parser Pass 2 semantic duplicate check (§5.3.2). Shadow: reasoner.
task_decompose reasoner Break a complex task into subtasks (PM Agent, §7.1.1).
extract_preferences reasoner Weekly batch extraction of correction patterns (§9.2).
generate_nudge local_parser Local LLM generates overdue nudge messages in persona.
generate_reminder local_parser Local LLM generates 15-minute pre-task reminders.
generate_weekly_digest parser Weekly efficiency + velocity digest (§11.1).
challenge_task local_parser Challenger Agent probes task quality and context (§7.1.1). Escalates unclassifiable tasks to Claude via the Novelty Judge.
classify_chat_intent local_parser Classifies Discord messages into {task, automation, question, chat} for the Conversation Engine (§24).
All of the above also declare their tool allowlist; the orchestrator's
ToolRegistry.dispatch() enforces it at invocation time and raises
ToolNotAllowedError on violations.
5.5 Task Intelligence
5.5.1 Natural Language Task Parsing
When the user sends a message like "Get oil change before end of month," the input parser extracts:
-
Title: "Get oil change"
-
Deadline: End of current month (soft deadline)
-
Domain: Personal (inferred from automotive context)
-
Priority: 2 initially (flexible, no urgency keywords)
-
Estimated duration: 60--90 minutes (inferred from task type)
5.5.2 Dynamic Priority Escalation
Priority is not static. The scheduler re-evaluates priority daily based on:
-
Deadline proximity: As a soft deadline approaches, priority increments. A task due "end of month" starts at priority 2 but escalates to 4 by the last week. Implemented in
scheduling/priority_engine.pyvia the_deadline_escalation()rule, usingdeadline_warning_days/deadline_critical_daysfromconfig/calendar.yaml. -
Reschedule count: Each reschedule feeds the
_workload_pressure()rule together with the day's scheduled density. The v3.0 "+0.5 per reschedule" formula was not kept verbatim; the rule now floors priority higher on crowded days where the task has been moved repeatedly. A flag at 3+ reschedules is still emitted. -
Dependency chains: Not yet implemented. The priority engine does not currently walk downstream waiters. Deferred to Phase 6. The dependency-resolver module (
scheduling/dependency_resolver.py) handles ordering for weekly planning but does not feed back into priority scoring. -
User override: Not yet implemented as a lock flag. The user can set priority manually, but there is no
priority_lockedcolumn preventing subsequent auto-adjustment. Deferred to Phase 6. -
Learned preferences: The preference engine may apply priority adjustments based on patterns extracted from correction history (see Section 9). This hook is live and runs in
InputParser.parse()after initial classification.
5.5.3 Task Complexity Assessment
-
Simple (\< 30 min, no dependencies): Auto-schedule without interrogation. Examples: oil change, pay bill, send invoice.
-
Medium (30 min--2 hours, may have dependencies): Schedule and optionally flag for prep work. Examples: research restaurants, draft email.
-
Complex (2+ hours, likely has subtasks): Route to PM agent for interrogation and decomposition. Examples: refactor module, build feature, plan event.
5.6 Task Domains
Domain Scheduling Window Priority **Notes Defaults**
Personal Evenings (5--8pm), Standard (1--3) Flexible scheduling, Weekends can fill gaps
Work 8am--5pm weekdays Standard to High Respects work calendar (extends to 7pm if (2--5) blocks needed)
Family Evenings, Weekends, High for Never Baby time blocks child-related auto-deprioritize (3--5) child care tasks
6. Scheduling Engine
6.1 Calendar Integration
Google Calendar is the single source of truth for the user's time. The assistant has read-write access to the personal calendar and read access to work and family calendars. All three calendars are Google Calendar --- no ICS forwarding workarounds needed.
6.1.1 Calendar Sync Strategy
Donna uses a polling-based sync with change detection. The scheduler polls Google Calendar every 5 minutes (configurable). On each poll, it compares the current calendar state against its local mirror stored in SQLite.
Donna-managed events are tagged with Google Calendar extended properties:
extendedProperties.private:
donnaManaged: \"true\"
donnaTaskId: \"\<task-uuid>\"
These are invisible to the user in the Calendar UI but readable by the API. This allows the system to distinguish its own events from user-created ones.
When the user modifies a Donna-managed event directly in Google Calendar:
-
Time change detected: Treated as an implicit reschedule. Task's scheduled_start updated in SQLite. reschedule_count incremented. Logged as a correction (feeds preference learning, e.g., "Nick always moves morning tasks to afternoon"). No notification sent (user already knows, they made the change).
-
Event deleted: Task moved back to backlog status. User notified on next interaction: "I noticed you removed the calendar event for [task]. Want me to reschedule it or leave it in your backlog?"
-
User creates non-Donna event that conflicts: Donna yields to user-created events. Conflict resolution rules apply (Section 6.1.2). The Donna-managed event is auto-shifted to the next available slot.
6.1.2 Conflict Resolution Rules
Conflict Type Resolution Notification
New meeting overlaps Auto-shift task to next None unless priority scheduled task available slot 4--5, then notify user
Two meeting Flag user immediately SMS or app notification invitations at same with options time
High-priority vs Auto-replace, reschedule Include in daily digest
low-priority in same lower-priority
slot
Task runs over Auto-extend and Notify if it impacts a estimated time cascade-shift subsequent hard-deadline task tasks
User cannot complete Accept reschedule or Confirm new time via a task auto-find next slot same channel
v3.1 Implementation Status: Only the first and last rows are fully implemented.
scheduling/scheduler.find_next_slot()detects overlap and returns the next available slot; the state machine handles user-triggered rescheduling. The three middle rows (dual-invite disambiguation, priority displacement inside a slot, cascade-shifting on overrun) are not implemented and deferred to Phase 6. Conflicts that the simple algorithm cannot resolve are surfaced to the user via the standard notification channel.
6.2 Time Constraints
Time Block Hours Task Types Allowed
Work 8:00 AM -- 5:00 PM Work domain tasks, meetings (weekdays)
Extended Work 5:00 PM -- 7:00 PM Work overflow, side projects
(weekdays, optional,
not configured in
v3.1)
Personal Time 5:00 PM -- 8:00 PM Personal tasks, R&R, projects, study
Baby Time Per calendar blocks Family tasks only; never schedule other work
Food Per calendar blocks Protected; no tasks scheduled
Emergency Work 10:00 PM -- 12:00 AM Only high-priority tasks user
(user-activated, explicitly opens
not configured in
v3.1)
Weekends 6:00 AM -- 8:00 PM Personal and family tasks. User reschedules freely.
Blackout 12:00 AM -- 6:00 AM No scheduling, no (always) notifications, no contact
Quiet Hours 8:00 PM -- 6:00 AM No new scheduling. Urgent (default) (priority 5) only.
v3.1 Implementation Status: Six of the eight windows are live in
config/calendar.yaml(Work, Personal, Weekend, Blackout, Quiet Hours; Baby Time and Food come from the user's Google Calendar as non-Donna blocks). Extended Work and Emergency Work are defined in spec only; the YAMLtime_windowsschema already supports them and they can be enabled without code changes.
6.3 Scheduling Algorithm
-
Weekly Planning (Monday mornings): Generate a proposed week plan. Present to user for review. Lock hard-deadline items first, then fill with flexible tasks.
-
Daily Recalculation (6:00 AM): Recalculate task priorities based on previous day's completion, new tasks, and calendar changes. (v3.1 note: this runs as
scheduling/priority_recalculator.pyand recalculates priorities only; full schedule rewrites happen on the weekly cadence viaweekly_planner.py.) -
Real-time Adjustment: When a new task arrives or is rescheduled, re-evaluate only affected slots, not the entire week.
-
Minimize Rescheduling: Prefer inserting new tasks into genuinely empty slots before displacing existing tasks. When displacement is necessary, move the lowest-priority, most-flexible task.
-
Get It Done Bias: Default to scheduling tasks as soon as possible while respecting constraints. Do not push tasks to "someday."
7. Sub-Agent System
7.1 Agent Architecture
The sub-agent system follows a hierarchical structure with the Orchestrator at the top and specialized agents below. All agents communicate through a shared message bus and write outputs to the task database.
7.1.1 Agent Hierarchy
The Orchestrator (core process, not a sub-agent) receives all incoming tasks and determines routing. Runs on the local LLM when available, with Claude API fallback.
Agent Responsibilities Tool Access Autonomy Level
Scheduler Agent Calendar management, Google Calendar High --- auto-schedules
time slot API priority 1--3 tasks
optimization, (read-write),
rescheduling, Task DB
reminders, weekly (read-write)
planning
Research / Prep Web research, Web search High --- runs
Agent information (MCP), Gmail autonomously when prep
compilation, resource (read-only), flagged. Results
gathering before Local filesystem delivered via email.
flagged tasks (MCP read),
GitHub (MCP
read)
Project Manager Task decomposition, Task DB Medium --- can Agent requirements (read-write), decompose and route, assessment, all other agents must confirm interrogation, work (dispatch) requirements with user packaging for other before dispatching agents
Coding Agent Code generation, file Local filesystem Low --- produces output
editing, project (MCP sandboxed for review. Never
scaffolding read-write), pushes to main. Never
GitHub (MCP deletes without backup.
read-write),
Claude Code CLI
Communication / Email drafts, message Gmail (draft Low --- always creates
Drafting Agent drafts, document only; send drafts. Never sends
creation behind feature externally without
flag), explicit approval.
Docs/markdown
(write),
Discord/Slack
(specific
channels only)
Challenger Pre-PM quality gate: Task DB (read), Medium --- surfaces
Agent probes new tasks for local LLM clarification
(v3.1, Phase hidden context, (classify_ questions and routes
3) missing success chat_intent, ambiguous tasks to
criteria, ambiguous challenge_task) the Novelty Judge.
scope. Emits
{accept, needs_input,
escalate_to_claude}.
Claude Novelty Judges tasks that Claude API Medium --- decides Judge Challenger cannot (reasoner whether the task (v3.1, Phase classify. Extracts alias), Task DB should become a new 3) structured intent and (read) capability / skill decides if the task candidate or be is reusable as a new handled ad-hoc. skill candidate.
v3.1 Implementation Status: Scheduler, Research/Prep, PM, Challenger, and Novelty Judge are enabled (see
config/agents.yaml). Coding Agent and Communication / Drafting Agent are defined in config butenabled: false--- they are deferred to Phase 6 and gated on Stage 3 tool-use progression (§8.3) and the MCP rollout (§3.2.3). The safety constraints in §7.3 remain their intended contract.
7.2 Agent Execution Flow
-
Orchestrator receives task. In v3.1 the first stop is the Challenger Agent, which probes for hidden context and returns one of {accept, needs_input, escalate_to_claude}.
needs_inputprompts the user directly;escalate_to_claudehands off to the Novelty Judge for capability matching. -
Accepted tasks flow to the PM Agent for completeness assessment. If requirements are missing, it sends targeted questions (not open-ended). Example: "For the Module A refactor, I need to know: (1) which API endpoints are affected, and (2) should backward compatibility be maintained?"
-
User responds. PM Agent updates the task with new information.
-
PM Agent packages the task with full context, requirements, acceptance criteria, and file references.
-
PM Agent dispatches to the appropriate execution agent. If the task matched a capability (§23.2), execution is handled by the Skills system; otherwise it goes to a sub-agent (Prep, Scheduler) or back to the user for manual action.
-
Execution agent works. Progress logged to activity log.
-
On completion, user receives summary via the same channel (typically Discord) and output is available for review.
7.3 Agent Safety Constraints
These constraints are non-negotiable and enforced at the system level, not reliant on agent prompting:
Constraint Enforcement
No sending emails to Gmail API scoped to draft-only by default. external addresses Send scope gated behind feature flag (disabled by default). Enabling requires config change + OAuth re-authorization with broader scope.
No deleting files Filesystem access is append/modify only. Deletes require explicit user command through UI.
No pushing to GitHub API restricts push to feature main/production branches branches. Branch protection at GitHub level.
No external purchases or No payment APIs integrated. No browser financial transactions automation for e-commerce.
No modifying Scheduling agent only creates/modifies events manually-created calendar tagged as donnaManaged: true. events
Backup before code Coding agent creates git stash or branch changes backup before any file modification.
Agent timeout enforcement Each invocation has a configurable timeout (default 10 min coding, 5 min research). Timeout triggers user notification.
Safety-first principle: All agents start with minimal autonomy. Constraints are relaxed only after reviewing logged agent performance and explicitly updating the agent's configuration. The system errs on the side of requiring user confirmation rather than acting autonomously.
Template-write autonomy (slice 15): The safety envelope for
autonomous vault writes is layered on top of the per-agent
autonomy field in config/agents.yaml. Each template-write skill
carries its own autonomy_level under memory.skills.<skill> in
config/memory.yaml — a skill-local value that can differ from the
owning agent's setting (per-template beats per-agent so different
templates under the same agent can demand different human review).
When autonomy_level == "low", MemoryInformedWriter rewrites the
caller-computed target_path to Inbox/{basename} before calling
VaultWriter.write, forcing the output into the user's review
queue. medium and high honour the caller's path. See §30.8 for
the full flow.
8. Local LLM Tool Use Progression
This section defines the phased approach to expanding local LLM capabilities beyond text-in/text-out processing. This work begins after the RTX 3090 is acquired and the local model has been validated on basic parsing tasks. Each stage requires passing evaluation thresholds before progressing.
8.1 Stage 1: Read-Only Tools, Single Call
Timeline: First month after local LLM deployment
Tools available: task_db_read, calendar_read
Purpose: Context enrichment during parsing. Examples: checking if a task already exists (deduplication), resolving "before my meeting" to an actual time by reading the calendar.
Evaluation: Use the offline evaluation harness (Section 4.5) to validate tool use accuracy against test fixtures. Additionally, enable shadow mode (Section 4.4) with Claude as the shadow to monitor production quality. Measure: did the model call the right tool with the right parameters? Did it incorporate the result correctly? Log every invocation.
Promotion threshold: 90%+ accuracy on tool selection and parameter correctness over 100+ samples.
8.2 Stage 2: Conditional Tool Use
Timeline: Second month after local LLM deployment
Challenge: The model must decide whether to use a tool, not just use it correctly. Input "buy milk" needs no tool call; "buy milk before my 3pm meeting" needs calendar_read.
Evaluation: Log every case where the model calls a tool unnecessarily or fails to call one when needed. Both false positive and false negative tool calls are tracked.
Promotion threshold: 85%+ precision and recall on tool use decisions over 100+ samples.
8.3 Stage 3: Write Tools with Guardrails
Timeline: Third month, only if Stage 2 performance is solid
Tools available: task_db_write (create tasks directly)
Guardrails: The model proposes a write operation; the orchestrator validates against the task schema before executing. Malformed entries are rejected and logged. The model never writes to calendar or triggers notifications directly --- those always go through the orchestrator's validation layer.
Evaluation: Compare model-proposed task entries against what a human (or Claude) would have created from the same input.
8.4 Tool Execution Architecture
The model never directly calls tools (whether MCP or internal API). The flow is: model outputs a tool call request → orchestrator validates the request (is this tool allowed for this task type? are the parameters well-formed?) → orchestrator executes via the appropriate integration module → result is fed back to model. This validation layer is model-agnostic --- the same path whether the request comes from the local LLM or Claude.
Tool access per task type is defined in the task type registry (Section 5.4). A task type configured with tools: [calendar_read] cannot result in a task_db_write call, regardless of what the model requests. This is enforced at the orchestrator level.
9. User Preference Learning
The preference learning system adapts to user behavior without model fine-tuning. It operates by logging corrections, extracting patterns from those corrections, and applying learned rules to future processing. All learned preferences are transparent, editable, and reversible.
9.1 Correction Logging
When the user corrects a system output (e.g., changes a task's domain, priority, or scheduled time), the correction is logged:
Field Type Description
id UUID Unique correction identifier
timestamp DateTime When the correction was made
user_id String Who made the correction
task_type String Which task type was wrong (e.g., parse_task)
task_id UUID The specific task that was corrected
input_text String Original natural language input
field_corrected String Which field was changed (domain, priority, etc.)
original_value String What the system produced
corrected_value String What the user changed it to
rule_extracted UUID? Link to extracted rule, if one was created
9.2 Rule Extraction
Rule extraction runs on a configurable schedule (default: weekly) or on demand. It batches recent corrections and sends them to Claude API for pattern analysis. Claude identifies recurring patterns and outputs structured rules.
Example extracted rule:
{
\"rule\": \"Tasks mentioning vehicle/car/automotive → domain: personal\",
\"confidence\": 0.9,
\"supporting_corrections\": [\"uuid1\", \"uuid3\", \"uuid7\"],
\"rule_type\": \"domain_override\",
\"condition\": {\"keywords\": [\"car\", \"oil change\", \"tire\", \"vehicle\"]},
\"action\": {\"field\": \"domain\", \"value\": \"personal\"}
}
9.3 Learnable Preference Types
-
Domain overrides: Keyword-based rules mapping task content to domains. Highly reliable, accumulate quickly. "Anything about cars is always personal."
-
Priority adjustments: Source-based or entity-based rules. "Tasks from [boss name] are always priority 4 minimum."
-
Scheduling preferences: Extracted from reschedule patterns. "Nick never does deep work before 10am." "Nick always reschedules Friday afternoon tasks to Monday."
-
Notification preferences: Extracted from response patterns. "Nick ignores app notifications but responds to SMS within 10 minutes."
-
Few-shot example accumulation: Well-handled corrections become few-shot examples in prompt templates. The prompt_template config supports an examples_file field pointing to a JSON file of labeled examples that gets prepended to the prompt.
9.4 Preference Application
Preferences are applied after initial model processing as a post-processing step. The model's output is the first draft; the preference engine is the editor.
Application order: model produces structured output → preference engine checks applicable rules → matching rules override relevant fields → orchestrator uses the final output for scheduling/routing.
9.5 Transparency & Control
All learned preferences are stored as readable, editable entries. The user can view, edit, disable, or delete any preference at any time. Example display:
Active Preferences:
1. Car/vehicle tasks → domain: personal (learned from 5 corrections)
2. Tasks from [boss] → priority: 4 minimum (learned from 3 corrections)
3. Never schedule personal tasks before 10am (learned from 8 reschedules)
[edit] [disable] [delete]
This transparency is a deliberate design choice. The system adapts in a way that is inspectable and reversible, which builds trust over time. If a rule causes corrections in the opposite direction, it is auto-disabled and flagged for user review.
10. Input Channels & Task Capture
The task capture system must be frictionless. The user's primary failure mode is not writing tasks down, so every channel must accept natural language with zero required structure.
10.1 Input Channel Matrix
Channel Implementation Cost Priority
Discord Bot Bot in dedicated Free P0 --- cross-device,
server/channel. (self-hosted) already installed
discord.py with
message intents.
Self-hosted on Linux
server.
SMS / Text Twilio number. Parsed $1--2/mo P0 --- fastest capture by LLM. (Twilio)
Desktop App Flutter desktop. Free P1 --- primary (Chat) WebSocket to (self-hosted) workstation interface orchestrator.
Web/Mobile App Flutter web/PWA hosted Firebase free P1 --- mobile access on Firebase. tier
Email Dedicated email alias. Free P2 --- capture from Forwarding Forwarded emails email threads parsed for tasks.
10.2 Discord Integration Detail
A dedicated Donna category in the existing Linux server alert Discord server with multiple channels:
-
donna-tasks: Task capture and responses. Multi-turn PM Agent¶
interrogations use Discord threads on the original task message. Thread ID provides natural context association.
-
donna-digest: Morning and evening digests. Clean, chronological¶
record.
-
donna-agents: Agent activity notifications, completion summaries,¶
cost per task.
-
donna-debug: System health alerts, cost warnings, error¶
notifications, circuit breaker status.
A full bot (not just webhooks) is required for bidirectional communication --- Donna sends messages AND reads user responses. Use discord.py with the message intent enabled so the bot can read replies and thread responses.
Discord's 2000-character message limit is handled via message splitting or embeds (which support richer formatting and up to 6000 characters across fields). Morning digests use embeds for structured presentation.
10.3 Conversation Context Management
Multi-turn interactions (PM Agent interrogation, clarification requests) require context tracking across messages. The approach differs by channel.
10.3.1 Discord: Thread-Based Context
When an agent needs follow-up information, it opens a Discord thread on the original task message. The user replies in-thread, and the bot associates responses by thread ID. No custom conversation context store needed for this channel. Threads provide natural grouping, history, and context.
10.3.2 SMS/Email: Conversation Context Store
SMS (Twilio) has no thread concept. When Donna sends a PM Agent question via SMS and the user responds hours later, the system needs to route that response correctly. A conversation_context table in SQLite tracks active interrogations:
Field Type Description
id UUID Context identifier
user_id String User being interrogated
channel Enum sms | email | slack
task_id UUID Task being interrogated
agent_id String Which agent initiated the interrogation
questions_asked JSON Array of questions sent to user
responses_received JSON Array of responses received
status Enum active | expired | completed
created_at DateTime When interrogation started
expires_at DateTime Default: 24 hours from creation
last_activity DateTime Last message sent or received
Routing logic for incoming SMS messages:
-
Check: is there an active conversation context for this user on the SMS channel?
-
If yes: route the message to that context's agent for processing.
-
If multiple active contexts exist (rare but possible): ask the user to disambiguate: "I have questions about two tasks. Which are you responding about: (1) [task A title] or (2) [task B title]?"
-
If no active context: treat the message as new task input (normal parsing pipeline).
Contexts expire after 24 hours of inactivity. On expiration, the agent re-prompts: "Hey, I still need info about [task]. [original question]." This re-prompt creates a new context with a fresh TTL.
For email: use email threading (In-Reply-To headers). If the user replies to a Donna email, the threading metadata maps directly to the originating task.
10.4 Input Parsing Pipeline
-
Receive raw text from input channel with metadata (source, timestamp, user context).
-
LLM parses input into structured task fields (title, deadline, domain, priority, etc.).
-
Preference engine applies learned rules to override/adjust parsed fields.
-
Deduplication check against existing tasks (Section 5.3). If duplicate detected, notify user and ask for merge/update.
-
Complexity assessment. Simple tasks auto-scheduled; complex tasks routed for interrogation.
-
Confirmation message sent back on same channel: "Got it. 'Oil change' scheduled for Saturday 10am. Priority 2."
10.5 Proactive Task Capture
-
End-of-meeting prompt: If calendar shows a meeting just ended: "Your standup just ended. Any new tasks or action items?"
-
Evening check-in: At configurable time (e.g., 7pm): "Anything you need to capture before tomorrow?"
-
Stale task detection: If a task has been in backlog 7+ days with no scheduled time: "This has been sitting unscheduled for a week. Should I schedule it or archive it?"
11. Notification & Escalation System
11.1 Notification Types & Channels
Notification **Channel Timing Content Type**
Morning Digest Discord 6:30 AM daily Full day schedule, task list, (#donna- prep results, agent activity, digest) carry-overs, system health summary. (v3.0 targeted email; production delivery is Discord.)
Task Reminders Discord 15 min before Task name, duration, prep start materials available
Overdue Nudge Discord → 30 min after Direct question: finish or SMS → Email scheduled end reschedule? Escalates via the (ladder) tiers in §11.2.
Agent Discord When PM/ Specific targeted questions Interrogation Challenger with context needs info
Agent Completion Discord When agent Summary, output location, finishes cost
End-of-Day Discord 5:30 PM Completed, rescheduled, agent Digest (#donna- weekdays activity, daily cost digest)
Weekly Digest Discord Monday 8 AM Velocity, completion ratio, (v3.1) (#donna- priority drift, cost summary digest)
Budget Alert SMS + Discord Daily $20 Spend breakdown, recommendation threshold or to continue/pause 90% monthly
Conflict Alert (not Immediately on Description, proposed implemented detection resolution options in v3.1)
Urgent Phone Call Critical Brief TTS message via Twilio
Escalation (TTS, deadline miss with callback option. Tier 4;
(Tier 4) disabled or system tier4_enabled: false by
by default) failure default.
v3.1 Delivery Channels: v3.0 assumed most digests/interrogations would go over email. In production, Discord is the primary channel for everything except SMS escalation (Tier 2), outbound budget alerts (SMS), and the email escalation (Tier 3, which uses Gmail draft mode when the send flag is off).
11.2 Escalation Tiers
-
Tier 1 --- App notification / Discord message. Wait 30 minutes.
-
Tier 2 --- SMS text message. Wait 1 hour.
-
Tier 3 --- Email with "ACTION REQUIRED" subject. Wait 2 hours.
-
Tier 4 --- Phone call (TTS). Only for priority 5 tasks or budget emergencies. Maximum 1 call per day.
Escalation resets when the user acknowledges any message on any channel. If the user responds "busy, will handle later," the system backs off for 2 hours.
12. Service Integrations
12.1 Integration Matrix
Service Access Level Integration Pattern Tools / Methods
Gmail Read-only (send Direct API [IMPLEMENTED] email_read, behind feature flag) (google-api-python-client) email_search, draft_create
Google Calendar Read-Write Direct API [IMPLEMENTED] calendar_read, (personal); Read (google-api-python-client) calendar_write, (work, family) calendar_delete
GitHub Read-Write (feature MCP (FastMCP) github_read, branches only) [DEFERRED Phase 6] github_write, github_issues
Notes (Local Read-Write MCP (FastMCP) notes_read, notes_write Markdown) [DEFERRED Phase 6]
Local Read-Write MCP (FastMCP) fs_read, fs_write, Filesystem (sandboxed to [DEFERRED Phase 6] fs_list /donna/workspace/)
Discord Read-Write (Donna Direct API (discord.py) discord_send, channels only) [IMPLEMENTED] discord_read, thread management
Twilio Write (outbound Direct API (twilio-python) sms_send, phone_call (SMS/Voice) only) [IMPLEMENTED; voice (Tier 4 disabled by disabled by default] default)
Web Search Read web_fetch local tool Simple HTTP fetch.
[PARTIAL] SearXNG / MCP search
still deferred.
SQLite Task DB Read-Write Direct API (aiosqlite) Internal orchestrator [IMPLEMENTED] access, no MCP overhead
Supabase Write (sync replica) Direct API (supabase-py) Background Postgres write-through sync
13. Cost Management & Monitoring
13.1 Budget Rules
Rule Threshold Action
Daily Spend Alert $20 (20% of Pause all autonomous agent work. monthly budget) Notify user via SMS with progress summary.
Task Cost Estimated cost Notify user before execution. Notification > $5 for a Require approval to proceed. single task
Monthly Warning 90% of $100 Pause work. Send detailed report. monthly budget Ask for budget increase approval.
Budget Increase User approves Increase for current month only. Approved additional funds Reset to $100 on the 1st.
Budget Increase User denies Remain paused on agent work. Denied Continue local LLM + scheduling (zero API cost when available).
13.2 Cost Tracking
Every Claude API call is tracked via the invocation_log (Section 4.3). Aggregated metrics include: cost per agent, cost per task, cost per task type, daily/weekly/monthly totals, and projected monthly spend based on current velocity.
Cost optimization loop: Weekly, the system reviews API calls and identifies patterns that could be offloaded to the local LLM. Shadow mode comparison data directly feeds this analysis.
13.3 Phase 1 Cost Projection
During Phase 1 (Claude API only, no local LLM), all parsing and classification runs on Claude. Projected costs based on typical usage:
Operation Daily Volume *Tokens/call *Daily Cost (est.) (est.)** (est.)**
Task parsing 10--20 tasks ~500 in / ~200 $0.10--$0.30 out
Priority 10--20 tasks ~300 in / ~100 $0.05--$0.15 classification out
Morning digest 1 ~2000 in / $0.02 ~500 out
Prep work research 2--5 tasks ~3000 in / $0.30--$0.75 ~1000 out
Agent work (Phase Variable Variable $1--$5 2+)
Phase 1 daily cost (no agents): approximately $0.50--$1.20/day, or $15--$36/month. Well within the $100 budget with substantial headroom for agent work in later phases.
14. Observability & Logging Architecture
Observability is a Phase 1 deliverable, not an afterthought. Every Donna service emits structured JSON logs to a centralized pipeline. A dedicated logging database and searchable dashboard ensure that debugging any issue requires seconds, not hours. The goal: no issue should ever require SSH and grep.
14.1 Logging Framework
All Python services use structlog with JSON output and contextvars for async context propagation. Every incoming request binds correlation_id, user_id, channel, and task_id as context variables that automatically appear in all downstream log entries. This enables full request tracing across services.
14.2 Log Levels
Level When to Use Examples
DEBUG Detailed diagnostics. Full prompt contents, API response Off in prod unless bodies, dedup similarity scores, troubleshooting. scheduler slot evaluation steps, preference rule matching details
INFO Normal operations. Task created, state transitioned, The system is working reminder sent, digest generated, correctly. calendar synced, agent dispatched, backup completed
WARNING Something unexpected API retry triggered, confidence below but the system threshold, reschedule count high, handled it. preference rule auto-disabled, degraded mode activated
ERROR An operation failed API call failed after all retries, but the system schema validation rejected, agent timed continues running. out, malformed user input couldn't be parsed
CRITICAL System-level failure Circuit breaker activated, database requiring immediate corruption detected, orchestrator crash, attention. NVMe space exhausted, all retries exhausted on critical path
14.3 Logging Database
v3.1 Implementation Status: The standalone
donna_logs.dbdescribed in v3.0 was not built. Per-invocation structured data lives in theinvocation_logtable insidedonna_tasks.db(WAL-mode SQLite); all other service logs stream over stdout → Docker json-file → Promtail → Loki, and are queried via Grafana. The table schema in §14.3.1 below is kept for reference; productioninvocation_logcolumns are listed after the spec table.
Dedicated SQLite database (donna_logs.db) on NVMe, target design: separate from the task database to avoid contention between high-volume log writes and task query performance.
14.3.1 Log Table Schema (design target)
Field Type Purpose
id INTEGER PK Auto-incrementing row ID
timestamp TEXT ISO When the event occurred (UTC) 8601
level TEXT DEBUG, INFO, WARNING, ERROR, CRITICAL
service TEXT Which service emitted: orchestrator, mcp_server, discord_bot, scheduler, notification, agent_worker, sync
component TEXT Sub-component: input_parser, calendar_sync, state_machine, preference_engine, etc.
event_type TEXT Machine-readable event name (e.g., task.created, api.call.failed, agent.timeout)
message TEXT Human-readable log message
correlation_id TEXT Unique ID for tracing a single request/task across all services and log entries
task_id TEXT? Associated task UUID
user_id TEXT? User who triggered the action
agent_id TEXT? Agent type if emitted from agent worker
channel TEXT? discord, sms, email, app, system
duration_ms INTEGER? Duration for timed operations (API calls, agent runs, scheduling cycles)
cost_usd REAL? API cost if this is a model call
error_type TEXT? Exception class name for ERROR/CRITICAL
error_trace TEXT? Full Python stack trace for ERROR/CRITICAL
extra TEXT (JSON) Arbitrary additional structured context
Indexes on: timestamp, level, service, event_type, correlation_id, task_id, error_type. WAL mode enabled for concurrent read/write.
Production invocation_log columns (v3.1): id, timestamp,
task_type, task_id, user_id, model_alias, model_actual,
input_hash, latency_ms, tokens_in, tokens_out, cost_usd,
output (JSON), quality_score, is_shadow, eval_session_id,
spot_check_queued, plus the v3.1 additions queue_wait_ms
(time in the LLM Gateway queue, §26), interrupted (whether a
higher-priority request preempted this call), chain_id (groups
multi-step skill chains), caller (free-form tag: skill,
agent, orchestrator), estimated_tokens_in, overflow_escalated
(whether a local-LLM call escalated to Claude), and skill_id (FK
to skill when emitted from a skill step). Free-form service logs
(info/warning/error messages, event_type breadcrumbs) never reach
SQLite and are queried in Loki via Grafana.
14.3.2 Retention Policy
-
DEBUG logs: 7 days retention (high volume, diagnostic only).
-
INFO logs: 30 days retention.
-
WARNING logs: 90 days retention.
-
ERROR and CRITICAL logs: 1 year retention (never auto-deleted during that period).
-
Invocation logs (Section 4.3): permanent retention (needed for cost analysis, evaluation, and preference learning).
A nightly cron job prunes expired logs based on level. VACUUM runs weekly to reclaim disk space.
v3.1 Status: Because free-form logs live in Loki, retention is governed by the Loki compactor (config under
docker/loki/) rather than the SQLite cron job described above.invocation_logrows are retained permanently, per v3.0 intent. A dedicated SQLite retention cron has not been implemented.
14.4 Event Types
Event types are hierarchical and machine-parseable. The first segment identifies the domain:
-
task.*: task.created, task.state_changed, task.dedup_detected, task.overdue, task.escalation_triggered
-
api.*: api.call.started, api.call.completed, api.call.failed, api.call.retried, api.circuit_breaker.opened, api.circuit_breaker.closed, api.degraded_mode.activated
-
agent.*: agent.dispatched, agent.progress, agent.completed, agent.failed, agent.timeout, agent.interrogation.sent, agent.interrogation.response_received
-
scheduler.*: scheduler.weekly_plan, scheduler.daily_recalc, scheduler.slot_assigned, scheduler.conflict_detected, scheduler.calendar_sync.completed, scheduler.calendar_sync.user_modification
-
notification.*: notification.sent, notification.failed, notification.escalated, notification.acknowledged, notification.blackout_blocked
-
preference.*: preference.correction_logged, preference.rule_extracted, preference.rule_applied, preference.rule_disabled
-
system.*: system.startup, system.shutdown, system.health_check, system.backup.completed, system.backup.failed, system.migration.applied
-
cost.*: cost.daily_threshold, cost.monthly_warning, cost.agent_paused, cost.budget_increase
-
sync.*: sync.supabase.push, sync.supabase.failed, sync.keepalive.sent
14.5 Per-Service Logging Detail
14.5.1 Orchestrator
Logs every task state transition (from/to/trigger/side effects), every routing decision (task type → model alias → resolved provider), every preference rule application, and every cost threshold event. Each incoming message gets a correlation_id that follows the task through parsing, scheduling, agent dispatch, and notification.
14.5.2 FastMCP Server
Logs every tool invocation with: calling agent, tool name, parameters (sanitized --- no credentials ever logged), result summary (truncated to prevent log bloat), and latency. Rate limit hits and authentication failures logged at WARNING level.
14.5.3 Agent Workers
Each agent logs: task received (with context summary), tools used (invocation count, latency per tool), intermediate reasoning steps (DEBUG level only), output summary, total duration, and cost. Agent failures include full error context and the state of work-in-progress so debugging doesn't require reproducing the failure.
14.5.4 Notification Service
Logs every outbound message: channel, recipient identifier (not full content for privacy), delivery status, escalation tier. Failed deliveries trigger automatic retry with separate log entries. Blackout-blocked messages logged at INFO level (not an error, expected behavior).
14.5.5 Scheduler
Logs each scheduling cycle: slots evaluated, conflicts detected, tasks moved, calendar sync results, sync delta (events added/modified/deleted). Performance timing on each cycle to detect scheduling slowdowns as task volume grows.
14.5.6 Discord Bot
Logs: messages received (channel, thread context, is_reply), messages sent (channel, content length), thread creation/closure, connection status changes, reconnection events. Full message content logged at DEBUG level only.
14.6 Log Pipeline
Phase 1 architecture uses a dual-write approach:
-
Each service writes structured JSON logs to stdout. Docker captures stdout via the json-file log driver.
-
Promtail (deployed in donna-monitoring.yml) tails Docker container logs and ships them to Loki.
-
Grafana queries Loki for the real-time dashboard (Section 15).
-
Simultaneously, a lightweight log collector module in the orchestrator process writes logs to the SQLite log database for programmatic access, retention management, and correlation analysis.
This dual-write ensures logs are available both in the real-time Grafana dashboard (via Loki, optimized for search and visualization) and in the persistent SQLite store (for long-term queries, automated analysis, and evaluation data).
15. Development Dashboard
The development dashboard is a Phase 1 deliverable deployed via donna-monitoring.yml. It provides real-time visibility into system behavior during development and ongoing operations. Built on Grafana + Loki (both free, open-source, Docker-deployable, minimal resource footprint).
15.1 Dashboard Panels
15.1.1 System Health Overview
-
Service status: green/yellow/red indicators for each Docker container (orchestrator, MCP server, Discord bot, notification service).
-
Last successful operations: timestamp of last calendar sync, last Supabase sync, last morning digest, last backup.
-
NVMe disk usage: total capacity, task DB size, log DB size, backups size, workspace size.
-
Memory and CPU: per-container resource usage (via Docker stats).
-
Circuit breaker state: open/closed/half-open with timestamp of last state change.
15.1.2 Task Pipeline
-
Tasks created today/this week (by channel, by domain).
-
State distribution: count of tasks in each state (backlog, scheduled, in_progress, blocked, done, cancelled).
-
Average time-to-schedule: from task creation to first scheduled slot.
-
Reschedule frequency: tasks rescheduled 3+ times highlighted.
-
Dedup hit rate: duplicate detections per day, false positive tracking.
-
Completion velocity: tasks completed per day/week, trend line.
15.1.3 LLM & Cost
-
API calls per hour/day (by task type, by model alias).
-
Token usage: input/output breakdown by task type.
-
Cost: daily/weekly/monthly spend, current burn rate, projected monthly total, budget remaining.
-
Latency: p50/p95/p99 response times by task type.
-
Error rate: failed API calls, retries triggered, circuit breaker activations.
-
Shadow mode comparison: side-by-side quality scores when shadow is active (Phase 3+).
15.1.4 Agent Activity
-
Active agents: what is currently running, on which task, elapsed time vs timeout limit.
-
Completed today/this week: task summaries with cost and duration.
-
Failed today: error summaries with expandable stack traces.
-
Cost per agent: breakdown showing which agent types consume the most budget.
15.1.5 Notifications
-
Messages sent today (by channel, by notification type).
-
Delivery failures: count and detail (channel, error reason).
-
Escalation events: tasks that required Tier 2+ escalation.
-
User response times: how quickly user acknowledges by channel (feeds notification preference learning).
15.1.6 Error Exploration
-
Recent errors: filterable table by service, component, event type, time range, and severity.
-
Error timeline: visualization of error frequency over time for spike detection.
-
Correlation trace: given a correlation_id, show the full lifecycle of that request across all services --- from input receipt through parsing, scheduling, agent dispatch, and notification.
-
Stack trace viewer: expandable error details with full Python traceback.
15.1.7 Preference Learning
-
Corrections per week: trend line.
-
Rules extracted: new rules created, rules auto-disabled.
-
Rule survival rate: percentage of rules still active after 30 days.
15.2 Alerting
Grafana alerting rules (shipped with donna-monitoring.yml configuration):
-
Service down: any container unhealthy for > 5 minutes → Discord #donna-debug webhook + SMS via Twilio.
-
High error rate: > 10 errors in 5 minutes → Discord #donna-debug webhook.
-
Circuit breaker opened → Discord #donna-debug + SMS.
-
Budget threshold → Donna's own notification system handles this (Section 13). No separate Grafana alert needed.
-
NVMe disk usage > 80% → Discord #donna-debug.
-
Supabase sync failure > 1 hour → Discord #donna-debug.
-
No orchestrator heartbeat for 10 minutes → External watchdog (Section 17.1.2) handles this.
15.3 Phase 4: Production Dashboard (Flutter)
The Flutter app (Phase 4) provides a user-facing dashboard distinct from the Grafana dev dashboard. It includes: calendar view with Donna-managed events highlighted, task board (kanban-style), agent activity monitor, cost summary, completion heatmap (GitHub contribution graph style), and weekly planning interface. The Flutter app reads from Supabase, not directly from SQLite.
16. Data Architecture
16.1 Database Strategy
-
Primary: SQLite on NVMe:
donna_tasks.db(WAL mode, single database). It hosts task data, corrections, preferences, invocation logs, skill/capability tables, automation tables, chat sessions, and auth tables (see the Alembic migration chain underalembic/versions/). v3.0 planned a separatedonna_logs.db; in practice free-form service logs went to Loki and the invocation log stayed co-located indonna_tasks.db. -
Replica: Supabase Postgres. Async write-through sync from SQLite (
integrations/supabase_sync.py, fire-and-forget task). Free tier with keep-alive HEAD request every 6 hours (changed from the "every 3 days" target in v3.0 to align with the implementedkeep_alive()method). -
Evaluation:
donna_eval.dbis not implemented; evaluation runs emit their outputs to JSON underfixtures/and to theinvocation_logtable with aneval_session_idtag.
16.2 Supabase Sync Strategy
The orchestrator pushes task changes to Supabase on write (async, non-blocking). The Flutter app reads from Supabase. If Supabase is down, the system keeps running --- SQLite is the source of truth. When Supabase recovers, a full reconciliation sync runs.
Keep-alive: a cron job on the Linux server pings the Supabase REST API every 3 days with a lightweight query (SELECT 1). This prevents the free tier's 7-day inactivity pause and costs nothing.
Supabase free tier includes 500MB database storage, which is sufficient for years of Donna task data. Upgrade to Pro ($25/month, 8GB included storage) when onboarding a second user or when the Flutter app goes live and needs reliable 24/7 access.
16.3 Backup Strategy
16.3.1 Backup Method
Uses SQLite's .backup API (Python connection.backup()) for consistent snapshots. Never file copy --- copying a WAL-mode SQLite database during writes can produce corrupted backups.
16.3.2 Schedule & Retention
-
Daily at 3 AM (blackout hours, minimal activity): full backup of donna_tasks.db and donna_logs.db.
-
Retention: 7 daily backups, 4 weekly backups (Sunday), 3 monthly backups (1st of month).
-
Worst case storage: ~14 backups × 500MB = 7GB. Trivial on 1TB NVMe.
-
Off-server: weekly and monthly backups pushed to cloud storage (Google Cloud Storage free tier 5GB, or Backblaze B2 at $0.005/GB --- ~$0.04/month for expected volume). v3.1 status: not implemented. Backups remain on local NVMe with rotation; off-server push is a Phase 6 item.
16.3.3 Recovery
RPO (Recovery Point Objective): 24 hours maximum data loss (last daily backup). Supabase replica provides secondary recovery path for task data, reducing effective RPO to the sync interval (near-real-time for write-through sync).
Recovery procedure: Stop orchestrator → copy backup to live database path → restart. Orchestrator detects restored DB (version marker) and triggers full Supabase re-sync. Documented in RECOVERY.md in the repo.
Pre-migration backup: Alembic migration runner automatically creates a backup before applying any migration. If migration fails, the backup is the rollback path.
16.4 Data Classification
Classification Storage Access
Task metadata SQLite (NVMe) + Orchestrator, all agents, UI
(titles, schedules, Supabase (sync)
priorities)
Task content SQLite primary, Relevant agents only
(descriptions, notes, Supabase sync
prep results)
Credentials (API Linux server only, MCP server / integration keys, OAuth tokens) encrypted (age/sops) layer process only
Agent outputs (code, Local filesystem User + relevant agent
drafts, research) (sandboxed, NVMe
workspace)
Cost/usage logs SQLite log DB (NVMe) Orchestrator, dashboard
System logs SQLite log DB + Loki Dev dashboard (Grafana)
Correction log & SQLite task DB (NVMe) Orchestrator, preference learned preferences engine
Sensitive personal Never in No agent access
files assistant-accessible
paths
17. Security & Privacy
-
Principle of least privilege: each agent only has access to the tools it needs, defined in the task type registry. The Coding Agent cannot read emails. The Drafting Agent cannot modify the calendar.
-
No credentials in agent context: agents request tool calls via MCP or orchestrator. They never see raw API keys or tokens.
-
Sandboxed filesystem: agents can only read/write within /donna/workspace/. No access to home directory, system files, or other project folders.
-
Git safety: all code changes go to feature branches. Main/production branches have push protection at GitHub level.
-
Email safety: Gmail API scoped to read-only + draft by default. Send scope gated behind feature flag (disabled by default). Enabling requires explicit config change + OAuth re-authorization.
-
No external data exfiltration: agents cannot send data to arbitrary URLs. MCP server whitelists allowed outbound destinations.
-
Tool validation layer: all model tool call requests are validated by the orchestrator before execution. The model proposes; the orchestrator disposes.
-
Blackout enforcement: 12:00 AM -- 6:00 AM hard block on outbound messages enforced at the notification service level, not agent level.
-
Log sanitization: credentials, tokens, and sensitive data are never written to logs. API request/response bodies logged only at DEBUG level with sensitive fields redacted.
-
NVMe encryption: the dedicated NVMe volume uses LUKS encryption at rest. Decryption key stored in TPM or entered at boot.
18. Resilience & Failure Handling
18.1 Health Monitoring
18.1.1 Layer 1: Docker Healthchecks
Each Donna service in the compose files gets a healthcheck directive. The orchestrator exposes an HTTP /health endpoint (lightweight aiohttp handler on dedicated port). Docker polls every 30 seconds. Three consecutive failures trigger container restart (restart: unless-stopped).
The /health endpoint checks: SQLite reachable, Discord bot connected, scheduler loop running, last Claude API health-check response \< 10 minutes old. Returns 200 if all pass, 503 with JSON body listing failures.
18.1.2 Layer 2: External Watchdog
A separate lightweight process (Python script or bash cron job) runs outside Docker. Every 5 minutes, checks docker inspect --format=\'{{.State.Health.Status}}\' donna-orchestrator. If the container is unhealthy or stopped, sends alert via Twilio SMS or Discord webhook (independent of the Donna bot). Catches the case where Docker itself cannot restart the container (persistent crash loop, port conflict, volume mount failure).
18.1.3 Layer 3: Daily Self-Diagnostic
Part of morning digest generation. Before generating the digest, the orchestrator runs a self-check: DB integrity (PRAGMA integrity_check), NVMe disk space, last successful calendar sync timestamp, last successful Supabase sync timestamp, pending migration check, budget status. Any issues are prepended to the morning digest so the user sees them first thing.
18.2 Acceptable Failures
-
Task priority misclassification --- user corrects manually, correction feeds preference learning.
-
Duplicate reminders --- minor annoyance, no data loss.
-
Agent produces low-quality code --- user reviews before merging; no production impact.
-
Scheduling engine places task at suboptimal time --- user reschedules, pattern feeds preference learning.
-
Local LLM misroutes a task to Claude API --- costs slightly more but task completes correctly.
18.3 Unacceptable Failures
-
Missing a deadline reminder: system must never silently let a hard-deadline task expire without escalating to the user.
-
Sending emails to unintended recipients: email sending is architecturally blocked (draft-only default, feature flag for send).
-
Deleting files without backup: filesystem operations are append/modify only. Deletes require explicit user action.
-
Overwriting code without version control: all code changes are branched and stashed before modification.
-
Exceeding budget without notification: cost monitoring runs synchronously with every API call. Budget pauses enforced at orchestrator level.
-
Contacting user during blackout (12am--6am): notification service has a hard block on outbound messages.
-
Agent running indefinitely: configurable timeout. Timeout triggers user notification and agent_status = failed.
-
Learned preference causing repeated errors: auto-disabled and flagged for user review.
-
Silent service failure: must be detected within 10 minutes via Docker healthcheck + external watchdog.
19. Testing Strategy
Four testing layers ensure correctness and catch regressions at appropriate cost levels.
19.1 Layer 1: Unit Tests (Core Logic)
Framework: pytest. Target: 90%+ coverage on scheduler time-slot allocation, state transition validation, preference rule matching, and dedup scoring. These are pure functions (input → output, no external dependencies). Tests run in \< 1 second and catch regressions immediately.
19.2 Layer 2: Integration Tests (Service Boundaries)
Test the orchestrator's interaction with SQLite, the MCP server's tool execution, and the notification service's channel dispatch. Use real SQLite (in-memory for speed) and mock external APIs (Google Calendar, Discord, Twilio). Framework: pytest + pytest-asyncio + aioresponses (for mocking HTTP). Target: every integration module has at least one test verifying the request validation → execution → response cycle.
19.3 Layer 3: LLM Output Evaluation
The offline evaluation harness (Section 4.5) covers LLM output quality. A small "smoke test" subset (3--5 Tier 1 fixtures per task type) runs as part of CI to verify the parsing pipeline still works after code changes. Budget: ~$0.05 per CI run. Full evaluation runs are triggered manually for model comparison.
19.4 Layer 4: End-to-End Scenario Tests
Simulate full user workflows: send a Discord message → task created in SQLite → scheduled on Google Calendar → reminder sent at scheduled time → user marks complete. These hit real APIs (test Google Calendar, test Discord channel) and are slow/expensive. Run weekly or before releases, not on every commit.
19.5 Test Data Management
Maintain a tests/fixtures/ directory with sample tasks, calendar states, user corrections, and expected outputs. Version-controlled alongside the code. Same fixtures used by unit tests and evaluation harness.
20. Implementation Phases
The project was originally divided into four phases. As of April 2026, Phases 1--5 are complete (Phase 5 overlapped with Phase 4 and expanded automations and the skill lifecycle). The list below is preserved as the historical plan; current status is summarized in the table at the end of the section, and the Phase 6 roadmap replaces the v3.0 Phase 4+ description.
Phase 1: Foundation (Weeks 1--4) --- COMPLETE
Goal: Task capture, basic scheduling, reminders, observability. Claude API only. Solve the day-one problem.
-
Set up Linux server Docker stack: donna-core.yml with orchestrator, integration layer, notification service.
-
Deploy donna-monitoring.yml: Grafana + Loki + Promtail for dev dashboard.
-
Build orchestrator service (Python asyncio) with SQLite task DB on NVMe including user_id field.
-
Implement structlog across all services; set up dedicated logging database (donna_logs.db).
-
Implement model abstraction layer with AnthropicProvider (OllamaProvider stubbed for later).
-
Implement structured invocation logging on every model call.
-
Implement API resilience layer: retries, degraded modes, circuit breaker, response validation.
-
Implement task lifecycle state machine (config-driven, loaded from task_states.yaml).
-
Define initial task types in task_types.yaml: parse_task, classify_priority, generate_digest.
-
Implement input parser via Claude API (natural language → task schema).
-
Implement task deduplication (two-pass: fuzzy + LLM semantic comparison).
-
Set up first input channel: Discord bot with dedicated category, channels, and thread-based context.
-
Integrate Google Calendar API (read-write on personal calendar, read on work/family).
-
Implement calendar sync strategy: polling, change detection, Donna-managed event tagging.
-
Build basic scheduling engine: auto-schedule in available slots, respect time blocks.
-
Implement reminder system: Discord notifications at scheduled times.
-
Implement overdue detection and nudge messages.
-
Morning digest via Discord (generated by Claude API in Donna persona).
-
Deploy Donna persona system prompt across all communications.
-
Configure Docker healthchecks for all services, external watchdog, daily self-diagnostic.
-
Configure Grafana dashboards: System Health, LLM & Cost, Task Pipeline, Error Exploration.
-
Set up SQLite backup automation (daily at 3 AM, retention rotation).
-
Set up Supabase project (free tier + keep-alive cron). Background write-through sync.
-
Set up Alembic for schema migration. Initial migration creates all tables.
-
Begin building tiered evaluation test fixtures (Tier 1 baseline through Tier 4 adversarial, version-controlled).
-
Implement unit tests for state machine, scheduler, preference matching, dedup logic.
-
Spot-check monitoring disabled (Claude evaluating itself is not useful).
Phase 1 Deliverable: User texts Donna on Discord to create tasks, gets a morning digest, receives reminders, and gets nudged about overdue items. Tasks auto-schedule around calendar events. All model calls logged with cost tracking. Full observability via Grafana dashboard. Backup running. Health monitoring active. Calendar sync operational.
Phase 2: Intelligence & Communication (Weeks 5--7) --- COMPLETE
Goal: Smarter scheduling, multi-channel communication, prep work, correction logging.
-
Add SMS input/output via Twilio. Implement conversation context store for SMS multi-turn interactions.
-
Add email monitoring (Gmail API read-only) for forwarded tasks and calendar invites.
-
Implement dynamic priority escalation algorithm.
-
Implement task dependency chains and auto-rescheduling.
-
Build notification escalation tiers (app → SMS → email → phone).
-
Implement prep work system: flag tasks, define instructions, Research Agent executes via Claude API.
-
Build FastMCP server (Python) with initial MCP tools: web search, notes, filesystem read.
-
Implement cost tracking dashboard panels in Grafana.
-
Implement correction logging: every user override is recorded.
-
End-of-day digest email.
-
Externalize prompt templates as files; begin few-shot example accumulation from corrections.
-
Integration tests for all service boundaries.
Phase 2 Deliverable: Multi-channel communication. Prep work runs before tasks. Costs tracked. Scheduling has priority escalation and dependencies. Correction data accumulating for preference learning.
Phase 3: Sub-Agents, Local LLM & Preferences (Weeks 8--11) --- COMPLETE
Goal: Autonomous task execution. Local LLM deployment (requires RTX 3090). Preference learning. Multi-user data model active.
-
Deploy donna-ollama.yml with RTX 3090; implement OllamaProvider behind model interface.
-
Run evaluation test fixtures against local model; validate parsing accuracy.
-
Run offline evaluation harness sequentially against candidate local models; compare across quantization levels and parameter sizes.
-
Build and run escalation awareness fixtures: validate model knows when to hand off vs handle.
-
Build and run instruction following fixtures: validate model can execute Claude-generated directives.
-
Enable shadow mode with Claude as secondary model for production monitoring on migrated task types.
-
Enable spot-check quality monitoring with initial rate of 0.10--0.20 for fast signal.
-
Begin local LLM tool use Stage 1 (read-only tools: task_db_read, calendar_read).
-
Implement rule extraction from correction log (weekly Claude API batch job).
-
Build preference engine: apply learned rules as post-processing on model output.
-
Implement preference transparency UI: view, edit, disable, delete learned rules.
-
Build agent worker pool with sandboxed execution environments.
-
Implement PM Agent: task decomposition, requirements interrogation, work packaging.
-
Implement Coding Agent: sandboxed code generation with git integration.
-
Implement Communication/Drafting Agent: email drafts, document creation.
-
Build agent activity log and monitoring system.
-
Implement budget controls: daily threshold, task cost approval, monthly ceiling.
-
Expand FastMCP server with GitHub tools, additional MCP endpoints.
-
Enable multi-user data paths (user_id scoping on all queries, per-user credentials in integration layer).
Phase 3 Deliverable: Sub-agents receive tasks, interrogate for requirements, and produce outputs. Local LLM handles validated task types with shadow monitoring and spot-check quality audits. Evaluation harness enables model comparison. Preferences learned from corrections. Multi-user infrastructure ready.
Phase 4: UI, Multi-User & Polish (Weeks 12+) --- PARTIALLY COMPLETE
Goal: Full dashboard, mobile app, second user onboarding, optimization.
v3.1 status: Backend multi-user (FastAPI + Supabase), Grafana
dashboards, admin REST API, push-notification plumbing, and proactive
capture prompts are all live. Flutter Web + Android client is scaffolded
under donna-app/ but is not the daily-driver UI yet. Local LLM
Stage 2--3 tool use, second-user onboarding, and completion heatmap
remain open.
-
Build Flutter Web + Android app with chat interface and production dashboard.
-
Upgrade Supabase to Pro plan ($25/month) for reliable multi-user access.
-
Implement calendar view, task board (kanban), agent monitor, cost dashboard.
-
Implement completion heatmap (GitHub contribution graph style).
-
Push notifications (FCM) for Android.
-
Weekly planning session feature (Monday morning interactive scheduling).
-
Proactive task capture prompts (post-meeting, evening check-in, stale task detection).
-
Local LLM tool use Stage 2--3 (conditional tool use, write tools with guardrails).
-
Onboard second user (dad): per-user preferences, calendar, notifications, persona config.
-
Cost optimization analysis: migrate validated task types from Claude to local based on evaluation data.
-
Dial back spot-check rate to 0.02--0.05 as confidence in local model quality stabilizes.
-
End-to-end scenario tests. Performance tuning and reliability hardening.
Phase 4 Deliverable: Full-featured application with visual dashboard, mobile access, refined autonomous workflows, multi-user support, and optimized cost routing between local and cloud models.
Phase 5: Skills, Capabilities & Automations --- COMPLETE
Goal (added post-v3.0): make Donna's behavior extensible at runtime through user-defined skills and scheduled automations, with a shadow/evolution loop that improves skill quality without human intervention.
-
Capability registry with embedding-based duplicate detection (§23.2).
-
Skill DSL, loader, and executor with per-step tool allowlists (§23.3).
-
Skill lifecycle (sandbox → shadow → trusted, with divergence, equivalence, and degradation gates).
-
Claude-driven auto-drafter for new skills and evolution loop for improving failing skills (§23.4).
-
Automations subsystem: cron-triggered skill invocations with per-run budget, alert conditions, and cadence policies (§25).
-
LLM Gateway: cost-aware priority queue with preemption, rate limiting, and budget alerts across all outbound LLM calls (§26).
-
Admin REST API: dashboards, log search, skill pipeline inspection, task/preference/agent CRUD, Supabase sync visibility (§27).
-
Authentication & access control: IP gate, device tokens, Immich SSO, email allowlist + verification (§28).
-
Conversation Engine: stateful multi-turn Discord chat with intent classification, token budgeting, and session TTL (§24).
Phase 5 Deliverable: Donna can learn new capabilities during normal use, promote verified skills from shadow to production, and run recurring automations end-to-end. Operators have a complete admin surface for inspection and control.
Phase 6: MCP, Remaining Agents, & Production UI --- NEXT
Goal: close the remaining gaps between the v3.0 design and the running system.
-
Build the FastMCP server and wire GitHub, Filesystem (sandboxed
/donna/workspace/), Notes, and Web Search (SearXNG or API) (§3.2.3--3.2.4). -
Enable Coding Agent and Communication / Drafting Agent with Stage 3 write tools under guardrails (§7.1.1, §8.3).
-
Dependency-chain priority escalation and
priority_lockedflag (§5.5.2). -
Full conflict-resolution matrix (§6.1.2), including dual-invite disambiguation and cascade-shifting.
-
Extended Work and Emergency Work time windows (§6.2).
-
Off-server backup push to GCS/Backblaze (§16.3.2).
-
Flutter production UI as daily driver.
-
Onboard second user (dad): per-user preferences, persona config, calendar, notifications.
Phase Status Summary (April 2026)
Phase Scope Status
1 Foundation (task capture, scheduling, Complete reminders, observability, budget)
2 Multi-channel (SMS, email), prep work, Complete escalation, dedup, corrections
3 Sub-agents (Challenger, Novelty Judge, Complete
PM, Prep, Scheduler), local LLM,
preferences learning, shadow mode
4 FastAPI backend, Grafana dashboards, Partial: Flutter admin API, proactive capture, push client and multi-user onboarding open
5 Skills, Capabilities, Automations, LLM Complete Gateway, Admin UI, Auth, Chat engine
6 MCP server, Coding/Communication Next agents, Phase-6 backlog (above)
21. Success Metrics
Metric Target (3 months) Measurement
Task capture rate 90%+ of action items Self-reported weekly assessment recorded
Schedule adherence 70%+ tasks completed in Automated: completed_at vs time slot scheduled_start
Reminder 80% response within 30 Automated: reminder sent to effectiveness min acknowledgment time
Agent task 5+ tasks/week delegated Automated: agent_status = completion and completed complete count
Budget efficiency Under $100/month with Automated: cost tracking agents active dashboard
Tasks completed 25+ with consistency Automated: completion heatmap per week data
Preference 80%+ of extracted rules Automated: rule disable/delete learning accuracy remain active after 30 rate days
Local LLM 50%+ of Automated: invocation_log migration rate parsing/classification model_actual analysis + eval on local model harness
Escalation 85%+ precision and Automated: escalation_awareness awareness recall on escalation fixture results in model sessions decisions
Instruction 90%+ constraint Automated: instruction_following
following compliance when fixture results in model sessions
executing Claude
directives
Zero unacceptable No missed reminders, no Automated: failure log failures unintended emails, no monitoring + alerting data loss
Mean time to \< 5 minutes for typical Manual: using Grafana dashboard + diagnose production issues correlation trace
22. Technology Stack Summary
Layer Technology Notes
Orchestrator Python (asyncio) Core service: routing, scheduling, state management, preference engine
Cloud LLM Claude API Primary LLM for all phases. Sonnet (claude-sonnet-4-20250514) for cost efficiency; Opus for critical tasks.
Local LLM Ollama + Llama 3.1 8B Deferred until 3090 acquired. (Q4_K_M) on RTX 3090 Dedicated GPU, no sharing.
Model Interface Python (AnthropicProvider, Standardized complete() interface OllamaProvider) with structured logging
Agent Framework Python + Claude API tool use Each agent is a Python process with defined tool access
Integration Python (internal API Direct calls for orchestrator. Layer modules) Centralized auth, audit logging.
MCP Server Python (FastMCP 3.x, LLM-facing tools only. CodeMode Streamable HTTP) for token efficiency. External client endpoint. Deferred to Phase 6. Production LLM tools go through the internal Skills tool registry (§23.3).
Task Database SQLite on NVMe (primary, WAL mode. Sub-ms reads. user_id
donna_tasks.db) on all tables. Also hosts
invocation_log, skill/capability,
automation, chat, and auth
tables (see §16.1).
Log Storage Loki (+ invocation_log in Service logs over Loki; per-LLM
SQLite) records in SQLite. No dedicated
donna_logs.db (see §14.3).
Cloud Replica Supabase (Postgres) Free tier + keep-alive → Pro at Phase 4. Write-through sync.
Observability Grafana + Loki + Promtail Phase 1 deliverable. (Docker) donna-monitoring.yml.
Structured structlog (JSON, Async-safe context propagation. Logging contextvars) Correlation IDs.
Web/Mobile App Flutter (Web + Android) Single codebase. Firebase Hosting. FCM push. Phase 4.
Backend API Python FastAPI REST API between Flutter app and orchestrator
Notifications Twilio (SMS/Voice), Gmail Multi-channel with escalation API, FCM, discord.py tiers
Deployment Docker Compose (multi-file donna-core, donna-monitoring, homelab pattern) donna-ollama, donna-app
Server OS Ubuntu Linux (always-on home i7-6700K, 32GB. GTX 1080 (Immich). server) RTX 3090 (Donna, TBA).
Storage 1TB NVMe dedicated to Donna DB, logs, workspace, backups, config, fixtures, model cache
Schema Alembic (SQLAlchemy) Version-controlled migrations for Migration SQLite + Supabase
Testing pytest, pytest-asyncio, 4 layers: unit, integration, LLM aioresponses, eval harness eval, E2E
Version Control GitHub All code, agent outputs on feature branches
Secrets age/sops or environment Never in code, never in agent Management variables context
Configuration YAML files (models, routing, Config over code for all task types, states, extensible behavior preferences)
23. Skills & Capabilities System (added v3.1)
The skills + capabilities system is the extensibility layer that lets Donna learn new behaviors at runtime without requiring a code deploy. It was built during Phases 3--5 and expands well beyond the "sub-agent" hierarchy in §7.
23.1 Definitions
-
A Capability is a user-facing pattern Donna can handle ("product_watch", "news_check", "email_triage", "fetch_and_summarize"). It declares a name, description, input schema, trigger type (
on_message,on_schedule,on_manual), and optionaldefault_output_shape. Capabilities carry a semantic embedding to detect duplicates against user intent. -
A Skill is a concrete implementation of one capability, authored in a small YAML DSL (multi-step,
for_each,retry,on_failure, per-step tool allowlists). Each capability may have multiple skill versions; exactly one is "primary" at a time.
Tables: capability, skill, skill_version,
skill_state_transition, skill_run, skill_step_result,
skill_fixture, skill_divergence, skill_candidate_report,
skill_evolution_log, correction_cluster.
Configuration: config/capabilities.yaml seeds initial capabilities;
config/skills.yaml controls lifecycle tunables (shadow sample
rate, promotion thresholds, degradation windows, auto-draft caps,
evolution gates, correction-clustering parameters).
23.2 Capability Registry & Matching
The CapabilityRegistry stores capabilities with embeddings
(capability.embedding BLOB). CapabilityMatcher performs cosine
similarity against the incoming task's vector and returns the
best match above a configurable threshold. When a match is found,
the orchestrator writes task.capability_name and
task.inputs_json (§5.1) and routes execution to the Skills
system instead of a generic agent. A ToolRequirements check
confirms the capability's declared tools are available before
dispatch.
23.3 Skill Executor and Tool Registry
The SkillExecutor resolves the primary version of a capability's
skill, renders the step graph, dispatches tool calls through the
internal Skills Tool Registry, and persists each step's input,
output, latency, and cost into skill_run / skill_step_result.
The tool registry enforces per-step allowlists (raises
ToolNotAllowedError on violation) and is the production
equivalent of the Phase 6 FastMCP server (§3.2.3). Current
registered tools include calendar_read, task_db_read, and a
small web_fetch; write tools (task_db_write, calendar_write)
are declared in configs but not yet implemented in the registry and
are gated on the Stage 3 tool-use work (§8.3).
23.4 Lifecycle: sandbox → shadow → trusted → degraded
Skills move through a gated lifecycle:
- Sandbox -- fixture-only; no production traffic.
- Shadow -- Claude-equivalence sampling against live inputs;
outputs compared via the equivalence judge. Divergences are
logged to
skill_divergence. Promotion gates inconfig/skills.yamlrequire a minimum shadow sample count, equivalence rate, and fixture-regression pass. - Trusted (primary) -- skill is the default executor for its
capability; Claude calls are reserved for shadow sampling and
when the skill emits
needs_claude. - Flagged / Degraded -- triggered by sustained divergence, cluster of corrections, or sharp failure-rate change. Donna falls back to a Claude-native implementation until the skill is fixed or a new draft supersedes it.
Supporting modules: skills/shadow.py, skills/divergence.py,
skills/equivalence.py, skills/lifecycle.py,
skills/evolution_gates.py, skills/degradation.py.
23.5 Auto-Drafting and Evolution
- Auto-drafter (
skills/auto_drafter.py) -- when the Novelty Judge (§7.1.1) flags a task as a reusable capability, Claude drafts a YAML skill + fixtures + output schema, which lands as a candidate via the Admin API (skill_candidates,skill_drafts). A human still promotes the candidate into the shadow stage. - Evolution (
skills/evolution.py,skills/evolution_scheduler.py) -- a nightly job clusters recent corrections by skill (correction_cluster), asks Claude to propose a patched skill version, runs it through the evolution gates, and --- if it passes --- submits it as a newskill_versionready for shadow evaluation.
23.6 Seed Content
The top-level skills/ and capabilities/ directories carry the
initial bundle: skill definitions (classify_priority,
dedup_check, email_triage, fetch_and_summarize, news_check,
parse_task, product_watch) and capability manifests
(email_triage, news_check). These are loaded on startup by
skills/seed_capabilities.py if the DB is empty.
24. Chat / Conversation Engine (added v3.1)
v3.0's input-parsing pipeline (§10.4) handles single-turn task
capture. Multi-turn conversational use is provided by the
Conversation Engine (src/donna/chat/) which underpins Discord
chat, intent classification, and clarification flows.
Key properties:
- Stateful sessions (
chat_session,chat_messagetables): each session has a user, channel, TTL, and a running summary produced when the session closes. - Intent classification via
classify_chat_intenttask type (local LLM), routing the message into one of{task, automation, question, chat}. Tasks feed the normal pipeline; automations create/modify entries in the automation repository (§25); questions use a read-only Q&A prompt;chatstays in the engine for conversational turns. - Context budget per session (default 24,000 tokens), with
summarization when the limit is approached. Configured in
config/chat.yaml. - Escalation budget (dollar cap per day in
chat.yaml) ensures a long-running Discord thread cannot blow the daily Claude budget.
25. Automations Subsystem (added v3.1)
Automations are user-defined recurring invocations of a capability
(most commonly product_watch, news_check, or a user-authored
skill). They live in src/donna/automations/ and the
automation / automation_run tables.
- Trigger model. Cron-style schedules via
croniter, plus manual "run now" triggers. - Dispatcher (
automations/dispatcher.py) picks between skill-based execution andclaude_nativefallback based on the capability's current skill state. A per-run budget is applied fromconfig/automations.yaml, and failures beyond a streak threshold pause the automation (automation.paused_until). - Cadence policy (
automations/cadence_policy.py) maps skill-state to a minimum interval so sandbox/shadow skills do not run as often as trusted ones. - Alert evaluator (
automations/alert.py) checks automation-specific conditions (e.g. "product price dropped below threshold") and emits a notification via the standard escalation ladder (§11). - State blob:
automation.state_blobpersists per-automation memory between runs (last seen item, last alert timestamp, etc.).
26. LLM Gateway & Queue (added v3.1)
All outbound calls to Claude and Ollama flow through the LLM
Gateway (src/donna/llm/). Purpose: keep a single, cost-aware,
preemptible choke point so budget enforcement and observability
never get bypassed.
-
Priority queue (
llm/queue.py) with two lanes:- Internal -- system-initiated calls (digests, reminders, evolution jobs, shadow sampling).
- External -- user-initiated calls (chat, task capture,
interrogation).
External calls can preempt internal ones;
invocation_logrecordsqueue_wait_ms,interrupted, and the preempting caller inchain_id/caller.
-
Rate limiter (
llm/rate_limiter.py) caps per-minute and per-hour outbound rates per provider; respectsdonna_models.yamlburst settings. -
Alerter (
llm/alerter.py) fires notifications on budget thresholds (daily $20 pause, 80%/90% monthly), circuit-breaker opens, and sustained queue backlogs. -
Cost enforcement. Pre-call, the gateway consults
CostTracker/BudgetGuard(§13,src/donna/cost/) and may raiseBudgetPausedError. Post-call, it writes the full metadata row intoinvocation_log.
The gateway is what makes the spec's "budget rules" (§13.1) enforceable in practice.
27. Admin API & Dashboard (added v3.1)
The Admin API (FastAPI under src/donna/api/) is the
operator's surface for inspection and control. It runs on port 8200
and is protected by the authentication layer (§28).
Routes (src/donna/api/routes/):
admin_dashboard.py-- task pipeline, completion velocity, budget, skill-system status cards, quality warnings. Thresholds come fromconfig/dashboard.yaml.admin_logs.py-- structured log search across Loki +invocation_logwith hierarchicalevent_typefiltering (addsllm_gateway,ui,adminbranches to the v3.0 taxonomy in §14.4).admin_invocations.py-- per-call analytics; drill-down into queue wait, chain, caller, skill_id.admin_agents.py-- enable/disable/inspect each agent.admin_tasks.py,admin_preferences.py,admin_access.py,admin_health.py,admin_config.py,admin_shadow.py-- CRUD and live inspection for their respective domains.skill_candidates.py,skill_drafts.py-- the promotion pipeline for candidate skills produced by the auto-drafter (§23.5).auth_flow.py,llm.py-- device registration and direct LLM-gateway controls.
Grafana dashboards (docker/grafana/dashboards/) consume the same
underlying data via Loki and the admin SQLite datasource; the admin
API exposes the same information to the Flutter client and CLI. The
Notification Dashboard panel (§15.1.5) and Preference Learning
panel (§15.1.7) listed in v3.0 are not yet built and are Phase
6 items.
28. Authentication & Access Control (added v3.1)
v3.0's §17 described security principles at a high level. v3.1
documents the production auth stack (src/donna/api/auth/).
- IP gate (
ip_gate.py) -- per-IP action log and rate limit. Actions:allow,challenge,deny. Trust durations (24h, 7d, 30d, 90d) are tiers inconfig/auth.yaml. - Device tokens (
device_tokens.py) -- sliding + absolute window tokens used by the Flutter client and CLI. Registration lives inauth_flow.py. - Immich SSO (
immich.py) -- federates identity with the homelab's Immich service; 60-second user cache, allowlist sync every 15 minutes. - Email allowlist + verification tokens
(
email_allowlist.py,verification_tokens.py,email_sender.py) -- self-service onboarding for new email addresses, with verification URL + expiry. - Service keys (
service_keys.py) -- shared secret for inter-service calls (e.g. the Ollama container → orchestrator). - Trusted proxies (
trusted_proxies.py) -- safeX-Forwarded-Forhandling for the Caddy edge indocker/caddy/. - Dependencies & router factory (
dependencies.py,router_factory.py) -- FastAPI composition so individual admin routes declare their auth requirements (Immich login, service-key-only, device-token + role, etc.).
Table users maps donna_user_id ↔ immich_user_id ↔ role; the
ip_access_log table stores access-attempt history.
29. Setup Wizard (added v3.1)
src/donna/setup/ provides an interactive wizard
(wizard.py, validators.py, phases.py) that bootstraps a fresh
deployment: checks environment (GPU, storage, Docker), validates
required env vars, walks the operator through Discord/Twilio/Gmail
OAuth, and runs the initial Alembic migration. It's orthogonal to
runtime behavior and exists mainly to make re-provisioning
reproducible.
30. Memory & Vault Subsystem (added v3.1 — slices 12 + 13 + 14 + 15)
Donna owns a durable, human-editable markdown workspace (slice 12) and
a semantic index over that workspace (slice 13). Together they give
agents a read/write surface for meeting notes, people profiles, daily
logs, and research artefacts — plus retrieval by meaning, not path —
without adding a second database daemon. The authoritative design
lives at docs/reference-specs/memory-vault-spec.md; this section is
the spec_v3 anchor.
30.1 Vault Plumbing (slice 12)
An Obsidian-compatible vault rooted at vault.root (config/memory.yaml).
VaultClient (src/donna/integrations/vault.py) provides read-only
read / list / stat / extract_links; VaultWriter is the sole
mutation path, enforcing the safety envelope from §7.3 (path
containment, .md extension, safety.path_allowlist,
safety.max_note_bytes, optimistic concurrency via expected_mtime,
frontmatter preservation). Every mutation produces exactly one git
commit via GitRepo (subprocess, no GitPython). WebDAV sync is
provided by a Caddy container so Obsidian desktop and mobile clients
can round-trip writes. Agents reach the vault via the
vault_{read,write,list,link,undo_last} tool family. Full design:
docs/reference-specs/memory-vault-spec.md §§1-6.
30.2 Semantic Memory (slice 13)
The memory layer adds three tables inside donna_tasks.db (§16.1):
memory_documents— one row per ingested source, keyed by(user_id, source_type, source_id). Soft-deleted viadeleted_atso the ANN index does not need tombstone sweeps.memory_chunks— one row per chunk, carryingcontent,token_count, JSON-encodedheading_pathprovenance, and theembedding_versiontag used at ingest time.vec_memory_chunks— a sqlite-vecvec0virtual table withchunk_id TEXT PRIMARY KEY, embedding FLOAT[384]. Loaded on the shared aiosqlite connection inDatabase.connect(); missing extension degrades tovec_available=Falsewithout blocking boot (memory tools simply don't register).
MemoryStore exposes put / upsert / upsert_many / delete /
reindex / search (design target in
docs/reference-specs/memory-vault-spec.md §8). Upsert hashes the
document body; unchanged hashes short-circuit without invoking the
embedding provider, so the invocation_log row count is a dedup
signal. search runs a single three-table join, ordered by vec0
distance, with score 1 - distance² / 2 on MiniLM's unit-normalized
vectors; results below retrieval.min_score are dropped and k is
clamped to retrieval.max_k.
30.3 Embedding Layer
EmbeddingProvider is a Protocol (src/donna/memory/embeddings.py)
honoured by MiniLMProvider (default, wraps
capabilities/embeddings.py), the deterministic test fake, and any
future provider (bge-small, Voyage-3-lite). Selection is config-only
via embedding.provider. Every embed emits one invocation_log row
per input text (task_type in {embed_vault_chunk,
embed_memory_query, embed_chat_turn, embed_task, embed_correction},
model_alias="minilm-l6-v2", tokens_in=0, cost_usd=0.0) so
§4.3's invocation-log contract holds for local work the same way
it does for cloud calls. The episodic task_type values are added
in slice 14 when ChatSource, TaskSource, and CorrectionSource
start routing upserts through the store; MemoryStore looks up the
correct task_type from source_type so the provider logs each
batch distinctly without every source needing its own provider
instance.
The chunker (MarkdownHeadingChunker, 256-token cap / 32-token
overlap) walks markdown headings H1–H3, preserves heading_path
stacks, and keeps fenced code blocks intact when they fit. Token
counting uses tiktoken cl100k_base when available and a
deterministic word+punct heuristic when the encoding file cannot be
fetched — the fallback over-counts English prose, so the effective
cap stays below MiniLM's true 256-WordPiece limit.
30.4 Ingestion
Slice 13 wires one source: the vault.
VaultSource.watch() tails watchfiles.awatch with a 500 ms
coalesce window and routes adds / modifies / deletes into
MemoryIngestQueue (batched upserts) or MemoryStore.delete
(soft-delete). VaultSource.backfill(user_id) walks the vault root
on boot, mtime-compares against memory_documents.updated_at, and
catches up anything newer-on-disk. Notes whose frontmatter contains
donna: local-only (or donna_sensitive: true) are flagged
sensitive; the flag propagates to every RetrievedChunk.metadata.
Slice 14 adds chat / task / correction sources using the same
MemoryStore — no schema changes are expected.
30.5 memory_search Tool
Agents retrieve via the memory_search skill tool
(src/donna/skills/tools/memory_search.py). Signature:
Returns provenance-tagged chunks (source_type, source_path,
heading_path, score, sensitive, metadata). Registered by
donna.skills.tools.register_default_tools(memory_store=...) and
included in allowed_tools for pm, scheduler, research, and
challenger in config/agents.yaml (§7.1).
30.6 Observability
A Grafana dashboard at docker/grafana/dashboards/memory.json
shows retrieval latency p50/p95, ingest-batch count, re-embed
counter, watcher-event breakdown, and average hits per query.
Structlog events fire on every ingest (memory_ingest_batch), every
retrieval (memory_retrieval), every vault-watcher change
(vault_watch_event), and every backfill (vault_backfill_done).
30.7 Scope Guardrails
Shipped in later slices (historical notes):
- Slice 14 added chat / task / correction ingestion —
ChatSource,TaskSource,CorrectionSourceusing the sameMemoryStore, no schema changes. Full design indocs/reference-specs/memory-vault-spec.md §9. - Slice 15 added template-driven vault writes (see §30.8).
Jinja templates live under
prompts/vault/; the shared orchestrator isMemoryInformedWriter; one reference trigger (MeetingNoteSkill) is wired end-to-end. - Slice 16 shipped the four remaining template writes
(
weekly_review,daily_reflection,person_profile,commitment_log), a centralPeople/{name}.mdstub auto-creator wired intoMemoryInformedWriter, and content-hash-based rename / move reconciliation inVaultSource.watch()(2 s TTL buffer; rename paths skip re-embedding). The writer-owned structlog events were renamed frommeeting_note_*tovault_autowrite_*and now carry atemplatefield.AsyncCronSchedulergained optionalday_of_week/minute_utckwargs for the weekly triggers; APScheduler is still not a project dependency (home-grown pollers remain the idiom).
Still deferred:
- Re-rendering autowritten notes when source data changes post-write (e.g., calendar event moves after the meeting note is written) → slice 17+.
- Supabase sync for
memory_documents/memory_chunksand thecalendar_mirror.attendeescolumn → slice 17. - BM25 / hybrid retrieval and eval harness → slice 17.
- Cloud embedding providers — Protocol supports them, no wiring shipped.
30.8 Template Writes (slice 15)
Slice 15 is the first slice where Donna writes out to the vault autonomously in response to triggers. Slices 12–14 were all inbound (ingestion + retrieval); slice 15 completes the read/write loop by producing scaffold notes the user then edits.
Shared infrastructure (reusable by slice 16's four remaining templates):
VaultTemplateRenderer(src/donna/memory/templates.py) —FileSystemLoader+StrictUndefinedJinja environment. Templates are self-contained: each emits its own frontmatter as a first-line---...---YAML block, and the renderer parses it back out viapython-frontmatterbefore returning(body, frontmatter_dict). Missing context keys raisejinja2.UndefinedErrorat render time, so templates must declare every variable they consume.MemoryInformedWriter(src/donna/memory/writer.py) — the single orchestrator every template-write skill delegates to. Owns autonomy-level path redirection (see §7.3 below), a frontmatter-keyed idempotency short-circuit that runs before any LLM spend, the prompt-template load + render viaModelRouter.get_prompt_template+ routedrouter.complete, the vault-template render, and theVaultWriter.writewithexpected_mtimepass-through. Any exception after the idempotency check emitsvault_autowrite_failedand returns a skippedWriteResult; never a partial write.resolve_person_link(src/donna/memory/linking.py) — returns[[People/{name}]]if the file exists, else[[{name}]]. Never auto-creates stubs; unresolved links surface in Obsidian's "Unresolved links" panel as a nudge for the user.
Reference trigger (meeting note):
MeetingEndPoller(src/donna/capabilities/meeting_end_poller.py) runs atmemory.skills.meeting_note.poll_interval_seconds. Its SELECT excludes events that already have an indexed meeting note:event_id NOT IN (SELECT json_extract(metadata_json, '$.calendar_event_id') FROM memory_documents WHERE source_type='vault' AND json_extract(metadata_json, '$.type')='meeting').json_extractis used over->>for broad SQLite version tolerance.MeetingNoteSkill(src/donna/capabilities/meeting_note_skill.py) composes context from three concurrentmemory_searchcalls — prior meetings (sources=["vault"], post-filtered onmetadata.type=='meeting'), recent chats mentioning any attendee, and open tasks tagged to any attendee — capped percontext_limits.{prior_meetings,recent_chats,open_tasks}. Target path:Meetings/{event.start:%Y-%m-%d}-{slug}.md. Idempotency key: the calendar event id.- New task type
draft_meeting_note(config/task_types.yaml,config/donna_models.yaml) routes to the reasoner model with structured output perschemas/draft_meeting_note.json(summary,action_item_candidates,open_questions,links_suggested).
Schema extension:
- Alembic migration
c9d1e3f5a7b2adds a nullableattendees TEXTcolumn tocalendar_mirror(JSON-encodedlist[{name, email}]).calendar.py::_parse_eventreadsitems[i].attendeesfrom the Google Calendar API payload (displayName preferred, email local-part fallback);calendar_sync.py::_update_mirrorJSON-encodes on upsert.
Observability:
- Structlog events:
meeting_end_detected(poller found an eligible event),meeting_note_skipped_idempotent(writer found a matching key),meeting_note_written(happy path),vault_autowrite_failed(any step raised). - Every
draft_meeting_notecall logs toinvocation_logper §4.3, withtask_type=draft_meeting_note,model_alias=reasoner, and realtokens_in/tokens_out/cost_usd(unlike the local embedding calls in §30.3). - The
memoryGrafana dashboard gains a "Template writes" row: writes-by-template timeseries, idempotent-skip rate, LLM-cost timeseries fortask_type=draft_meeting_note, and autowrite failure count.
Design intent: the meeting note is a scaffold, not a pretend
transcript. Future audio-transcription work will replace the
LLM-drafted summary stub with real content; the scaffold nudges the
user to fill in decisions and action items, and the attendee
wikilinks + prior-meeting backlinks mean Donna's own writes flow
back into memory_search (closed loop: Donna's writes become
memory).
- Attachment indexing (images, PDFs). V1 is .md only.
--- End of Specification ---