Skip to content

Workflow: Run Evals

Realizes: spec_v3.md §4.5 Offline Evaluation Harness.

Purpose

Regression-test prompt / model / skill changes against version-controlled fixtures. Each fixture set is organized into tiers; tiers have numeric pass gates so failures are deterministic.

Tier Definitions

Tier Intent Pass Gate
tier1 Baseline — straightforward cases ≥ 0.90
tier2 Nuance — plausible ambiguity ≥ 0.85
tier3 Complexity — multi-clause / chained ≥ 0.75
tier4 Adversarial — ambiguous / contradictory ≥ 0.60

Run It

# Single task type, single model
donna eval --task-type task_parse --model anthropic/claude-sonnet-4

# Compare models
donna eval --task-type task_parse --model ollama/qwen2.5:32b-instruct-q6_K
donna eval --task-type task_parse --model anthropic/claude-sonnet-4

# Specific tier
donna eval --task-type classify_priority --tier 3

What Gets Measured

Per fixture:

  • Correctness — structured output matches expected
  • Latency — p50/p95
  • Cost — $ per invocation (from invocation_log)
  • Token usage

Per tier:

  • Pass rate vs gate
  • Regression vs last green run

Where to Look

Piece Location
CLI donna.cli
Fixtures fixtures/<task_type>/tierN/
Invocation log donna_logs.db — table invocation_log
Model aliases config/donna_models.yaml