Model Abstraction & Evaluation Layer¶
Split from Donna Project Spec v3.0 — Section 4
Core Principle¶
The orchestrator and agents never call a specific model provider directly. All LLM interactions go through a standardized interface handling provider abstraction, structured logging, routing, and shadow evaluation.
Model Interface¶
Every model call goes through:
Metadata always includes: latency_ms, tokens_in, tokens_out, cost_usd, model_actual (resolved provider + model name), is_shadow.
Two implementations: AnthropicProvider (Claude API) and OllamaProvider (local LLM). A third can be added without changing calling code.
Source Files¶
src/donna/models/
├── __init__.py
├── router.py # ModelRouter — config-driven routing, escalation, overflow handling
├── types.py # CompletionMetadata dataclass
├── tokens.py # estimate_tokens() — len//4 baseline (gate floor); router calibrates a per-task EMA divisor for the Ollama budget gate
├── validation.py # validate_output() — JSON Schema (draft-07) validation
├── quality.py # Spot-check quality monitoring (Claude-as-judge)
└── providers/
├── __init__.py # ModelProvider protocol
├── anthropic.py # AnthropicProvider — Claude API
├── ollama.py # OllamaProvider — local LLM
└── _parsing.py # parse_json_response() — strip markdown fences from LLM output
ModelRouter (router.py)¶
The ModelRouter class is the central dispatch point. It loads routing configuration from donna_models.yaml and task_types.yaml, resolves task_type to model_alias to provider, and dispatches completions through the resilience layer (circuit breaker, retry).
Per-provider circuit breakers. Each provider name (ollama, anthropic, …) gets its own CircuitBreaker, created lazily via _breaker_for(provider). resilient_call is handed the breaker of the provider that will actually run the call — after an overflow escalation, that is the fallback's provider — so a local-GPU (Ollama) outage opens only Ollama's breaker and never short-circuits cloud (Anthropic) calls. When a breaker opens, the router dispatches a dispatch_fallback_alert in addition to the critical log.
Every billed attempt is logged. resilient_call retries the whole billed provider call; the router passes it an on_attempt_failure hook and an is_transient_error classifier so that (a) each billed attempt — including one whose response failed to parse — writes its own invocation_log row marked interrupted=1 (providers raise ResponseParseError carrying the pre-parse usage), and (b) non-retryable errors (auth / 4xx) fail fast instead of burning the retry budget; transport / 5xx / 408 / 429 / 529 retry. No billed call goes unlogged.
Key error types raised by ModelRouter.complete():
| Error | When Raised |
|---|---|
EscalationDecisionError |
Over-budget gate resolves to a terminal mode (pause, cancel, claude_code, chat). Carries mode, escalation_request_id, correlation_id. |
TokenLimitReachedError |
Provider truncated output at extension-derived token cap. Caller should re-estimate and re-offer escalation. |
ContextOverflowError |
Prompt exceeds local model budget (num_ctx - output_reserve) and no fallback configured. |
RoutingError |
Task type has no routing entry in config. |
CompletionMetadata (types.py)¶
Returned alongside every LLM completion:
| Field | Type | Purpose |
|---|---|---|
latency_ms |
int | Wall clock time |
tokens_in |
int | Input tokens consumed |
tokens_out |
int | Output tokens generated |
cost_usd |
float | Computed cost |
model_actual |
str | Resolved provider + model name |
is_shadow |
bool | Shadow run flag |
estimated_tokens_in |
int? | Pre-dispatch token estimate (for drift tracking) |
overflow_escalated |
bool | True if request fell back from local to cloud due to context overflow |
token_limited |
bool | True if provider truncated output at extension cap |
Token Estimation (tokens.py + self-calibrating divisor)¶
estimate_tokens(text) provides the zero-dependency len(text) // 4 baseline, still used for the deterministic escalation-gate cost floor.
For the Ollama context-budget gate, the router no longer trusts the constant 4: a constant divisor under-estimates dense prompts, which silently truncates the context window — the exact failure the loud ContextOverflowError exists to prevent. Instead the router keeps a per-task-type EMA of the observed len(prompt) / tokens_in ratio from completed Ollama calls and uses the clamped EMA (times safety_factor) as the divisor. Knobs live under ollama.token_estimation in config/donna_models.yaml:
| Knob | Meaning |
|---|---|
safety_factor |
Multiplier on the divisor; <1.0 shrinks it and inflates the estimate (headroom vs truncation). |
ema_alpha |
EMA smoothing in [0,1]; higher reacts faster to recent calls. |
divisor_bounds |
[low, high] clamp so one anomalous tokens_in can't poison the EMA. |
No tokenizer dependency is added — the estimator calibrates from observed traffic. Exact tokenization stays deferred (trigger: combined overflow + truncation_suspected rate >10%).
Output Validation (validation.py)¶
validate_output(data, schema) validates LLM JSON output against a JSON Schema (draft-07). On mismatch, raises ValidationError with a list of error messages. Used by agents and the orchestrator to enforce structured output contracts.
Response Parsing (providers/_parsing.py)¶
parse_json_response(text) extracts clean JSON from LLM text responses, stripping markdown code fences if present. Shared by both AnthropicProvider and OllamaProvider.
Routing Configuration¶
The routing table in config/donna_models.yaml maps model aliases to providers and defines per-task-type behavior. This is the primary control surface.
During Phase 1 (Claude API only), all aliases point to Anthropic. Switching a task type to local = change provider and model fields for the relevant alias.
The shadow key enables production monitoring (secondary model runs in parallel, output logged only, not used). Offline evaluation is triggered via CLI, not configured in routing.
Structured Invocation Logging¶
Every model call is logged to the invocation_log table. This is a hard
guarantee, not a convention: production routers are constructed via
build_model_router(), which requires an invocation_logger, and
ModelRouter.complete() raises RoutingError rather than make an unlogged
billed call. Because BudgetGuard computes spend from invocation_log, an
unlogged call would be invisible to the budget cap — so the choke point is the
accounting boundary, not just the dispatch boundary. (Bare ModelRouter(...)
construction is reserved for offline config-accessor paths — e.g. the eval CLI —
that never call complete().)
The logged cost_usd is computed from the per-alias config rates
(input/output_cost_per_token_usd), the single source of price truth; the
Anthropic provider fails loud on an unpriced model id rather than silently
applying Sonnet pricing.
| Field | Type | Purpose |
|---|---|---|
| id | UUID | Unique invocation ID |
| timestamp | DateTime | When the call was made |
| task_type | String | Which task type (parse, classify, generate, etc.) |
| task_id | UUID? | Associated task if applicable |
| model_alias | String | Config alias used (parser, reasoner, etc.) |
| model_actual | String | Resolved provider + model |
| input_hash | String | Hash of input for dedup and comparison matching |
| latency_ms | Int | Wall clock time |
| tokens_in | Int | Input tokens consumed |
| tokens_out | Int | Output tokens generated |
| cost_usd | Float | Computed cost ($0.00 for local before cost approx configured) |
| output | JSON | Actual structured response |
| quality_score | Float? | Filled by spot-check or offline eval |
| is_shadow | Boolean | Shadow run (production monitoring) or eval run |
| eval_session_id | UUID? | Groups invocations from a single eval run |
| spot_check_queued | Boolean | Queued for Claude-as-judge review |
| user_id | String | User who triggered the call |
| queue_wait_ms | Int? | Time spent in LLM queue before dispatch |
| interrupted | Boolean | True if the call was interrupted before completion |
| chain_id | String? | Groups related calls in a multi-step chain |
| caller | String? | Identifies the calling module/agent |
| estimated_tokens_in | Int? | Pre-dispatch token estimate (for drift tracking vs actual tokens_in) |
| overflow_escalated | Boolean | True if request fell back from local to cloud due to context overflow |
| skill_id | String? | Associated skill if the call was skill-triggered |
| escalation_request_id | Int? | Links to an escalation request when the call is budget-gated |
| payload_path | String(300)? | Filesystem path to full request/response JSON (written by PayloadWriter, evicted by PayloadEvictor) |
Shadow Mode (Production Monitoring)¶
Runs secondary model on same input without affecting primary output. Use case: after migrating task_parse from Claude to local, keep Claude as shadow for 2–4 weeks. If quality degrades, revert by changing routing config.
Routes through complete(). Shadow is no longer a side-channel that calls a provider directly. _run_shadow re-dispatches through self.complete(..., is_shadow=True) (resolving the shadow alias via _lookup_routing_entry + a synthetic <task_type>::__shadow__ route), so the shadow inherits the budget pre-check, the per-provider circuit breaker, and — critically — invocation_log accounting: shadow spend lands as an is_shadow=1 row with its real cost_usd. An is_shadow recursion guard ensures a shadow call never spawns its own shadow.
Kill switch. Shadow is gated by shadow.enabled in config/donna_models.yaml (default false). Because shadow doubles real billed spend, it stays off until an operator opts in; a configured routing.<task>.shadow alias is inert while the switch is off. The full statistical auto-disable job (design B) is deferred until shadow is enabled in production.
Cost implication: Doubles model cost for that task type — now visible in the ledger. Intended as temporary — disable once confidence is established.
Offline Evaluation Harness (Model Comparison)¶
Development tool for comparing models against the same test inputs. Triggered via CLI, not part of production routing. Primary purpose: model selection for local LLM.
Tiered Test Fixtures¶
fixtures/
├── parse_task/
│ ├── tier1_baseline.json # ~10 cases: simple, unambiguous
│ ├── tier2_nuance.json # ~15 cases: implicit deadlines, ambiguity
│ ├── tier3_complexity.json # ~10 cases: multi-part, dependencies
│ └── tier4_adversarial.json # ~5 cases: edge cases, contradictions
├── classify_priority/
├── generate_digest/
├── deduplication/
├── escalation_awareness/
│ ├── should_escalate.json
│ └── should_handle.json
└── instruction_following/
├── claude_decomposition.json
├── constraint_compliance.json
└── correction_application.json
Tier Definitions¶
| Tier | Name | Cases | Pass Gate |
|---|---|---|---|
| 1 | Baseline | ~10 | 90%+ to continue |
| 2 | Nuance | ~15 | 80%+ |
| 3 | Complexity | ~10 | 60%+ |
| 4 | Adversarial | ~5 | No gate — diagnostic only |
Sequential Evaluation¶
One GPU runs one model at a time. Harness runs sequentially: load model A → all fixtures → save results → swap to model B → repeat.
Evaluation Dimensions¶
Escalation Awareness — does the model know when NOT to try?
| Metric | Target | Rationale |
|---|---|---|
| Precision (correctly escalated) | 85%+ | Over-escalation wastes money but produces correct results |
| Recall (caught tasks it shouldn't handle) | 85%+ | Under-escalation produces garbage. Less tolerable. |
| False positive rate | < 25% | Above this, cost savings undermined |
Instruction Following — can the model execute Claude-generated directives?
| Metric | Target |
|---|---|
| Constraint compliance | 90%+ |
| Format adherence | 95%+ |
| Rule application accuracy | 85%+ |
| Rule false application | < 10% |
Spot-Check Quality Monitoring¶
Active when local LLM is handling traffic (quality_monitoring.enabled: true in config). Implemented in src/donna/models/quality.py.
spot_check_rate: 0.05(5% sampled, higher during early deployment)- Batch job sends sampled outputs to Claude-as-judge
- Scores below
flag_threshold: 0.7create a Donna task for user review - Corrections flow into correction log (see
docs/preferences.md)
Confidence Scoring¶
- Self-assessed (default): Include
confidencefield (0.0–1.0) in output schema. - Logprob-based (optional): Examine average token logprobs from Ollama API.
Start with self-assessed, correlate with actual accuracy, upgrade if unreliable.
Local Model Cost Approximation¶
models:
parser:
provider: ollama
model: llama3.1:8b-q4
estimated_cost_per_1k_tokens: 0.0001 # hardware amortization
Never show local inference as "free" — enables genuine cost-per-quality comparison.
Local LLM Context Window Strategy¶
Ollama defaults to a 2048-token window unless num_ctx is explicitly set. Donna configures it on every Ollama call via two knobs in config/donna_models.yaml:
ollama.default_num_ctx— the total window (prompt + output) for all Ollama aliases.ollama.default_output_reserve— tokens held aside for model output so the prompt budget never clips mid-generation.
Per-alias overrides live on the individual model entry: models.<alias>.num_ctx.
Pre-dispatch budgeting¶
Before dispatching to a local alias, ModelRouter estimates prompt tokens with the self-calibrating divisor (see Token Estimation above) and compares against num_ctx - output_reserve. If the estimate exceeds the budget:
- If the task type has a
fallbackconfigured, the call escalates to the cloud alias. Acontext_overflow_escalationwarn event is logged, andinvocation_log.overflow_escalatedis set to1. - If no fallback exists, the router raises
ContextOverflowError. This is deliberate — silent truncation produces silent garbage.
Post-call truncation tripwire¶
Estimation is necessarily approximate, so a prompt can still slip past the pre-dispatch gate and saturate the window. After every Ollama call the router checks tokens_in >= num_ctx - output_reserve. When that holds, the response was likely silently truncated — the failure the loud ContextOverflowError exists to prevent. The router logs truncation_suspected=True, dispatches a dispatch_fallback_alert, logs the (billed-but-suspect) local row as interrupted, and re-dispatches once to the configured fallback alias — or raises ContextOverflowError if none exists. This closes the under-estimate hole and adds truncation_suspected to the upgrade gauge so the exact-tokenization trigger can actually fire (it previously counted only detected overflows).
Every call to an Ollama alias records invocation_log.estimated_tokens_in alongside the actual tokens_in reported by Ollama, and feeds the observed ratio back into the per-task-type divisor EMA. The LLM Gateway dashboard surfaces mean absolute error as a gauge for when to upgrade the estimator to exact tokenization.
Future Enhancements¶
Documented in archive/2026-04-12-local-llm-context-strategy-design.md:
per-task-type compaction strategies, pgvector brain on Supabase, exact tokenization via Ollama, and per-alias daily caps on overflow escalations. Tracked as G-26 – G-29.
Cost Tracking Scope¶
Cost tracking and budget limits apply exclusively to LLM API costs (Claude API and local model cost approximations). External service costs (Twilio, Firebase, Supabase, Gmail API) have their own billing dashboards and are not tracked by Donna's cost system.
All LLM API costs are tracked via the invocation_log table. Budget limits are configurable in config/donna_models.yaml under the cost section:
- Monthly budget: Hard cap on total LLM spend.
- Daily pause threshold: Pauses autonomous agent work when reached.
- Per-task approval threshold: Requires user approval before executing expensive tasks.
- Monthly warning percentage: Alerts at configurable percentage of monthly budget.
- Per-model-alias limits (optional): Cap spend on specific model aliases (e.g., limit
reasonerseparately fromparser) to control expensive task type spend independently.