Workflow: Run Evals¶
Realizes: spec_v3.md §4.5 Offline Evaluation Harness.
Purpose¶
Regression-test prompt / model / skill changes against version-controlled fixtures. Each fixture set is organized into tiers; tiers have numeric pass gates so failures are deterministic.
Tier Definitions¶
| Tier | Intent | Pass Gate |
|---|---|---|
| tier1 | Baseline — straightforward cases | ≥ 0.90 |
| tier2 | Nuance — plausible ambiguity | ≥ 0.85 |
| tier3 | Complexity — multi-clause / chained | ≥ 0.75 |
| tier4 | Adversarial — ambiguous / contradictory | ≥ 0.60 |
Run It¶
# Single task type, single model
donna eval --task-type task_parse --model anthropic/claude-sonnet-4
# Compare models
donna eval --task-type task_parse --model ollama/qwen2.5:32b-instruct-q6_K
donna eval --task-type task_parse --model anthropic/claude-sonnet-4
# Specific tier
donna eval --task-type classify_priority --tier 3
What Gets Measured¶
Per fixture:
- Correctness — structured output matches expected
- Latency — p50/p95
- Cost — $ per invocation (from
invocation_log) - Token usage
Per tier:
- Pass rate vs gate
- Regression vs last green run
Where to Look¶
| Piece | Location |
|---|---|
| CLI | donna.cli |
| Fixtures | fixtures/<task_type>/tierN/ |
| Invocation log | donna_logs.db — table invocation_log |
| Model aliases | config/donna_models.yaml |