LLM Gateway¶

The LLM gateway is a priority queue system that mediates all outbound model calls, managing GPU model state on the local RTX 3090, rate-limiting external callers, and alerting on operational anomalies.

Realizes: spec_v3.md §26

Overview¶

The LLM gateway (src/donna/llm/) sits between the model router and the actual Ollama/Claude backends. All local model calls funnel through a two-queue worker (LLMQueueWorker) where Donna's internal tasks always take priority over external API gateway requests. During active hours, running external requests can be preempted mid-inference to service internal work.

The gateway solves a specific hardware constraint: a single RTX 3090 running Ollama can only serve one model at a time. Loading a new model (e.g., switching from qwen2.5:32b-instruct-q6_K to a smaller model for a different task) requires unloading the current model and loading the new one, which takes 30--120 seconds. The GpuTracker monitors model swap frequency, latency, and overhead percentage, triggering alerts when swaps consume too much queue time.

The gateway also provides per-caller rate limiting via sliding windows, debounced Discord alerts for operational events (queue depth, rate limiting, starvation, budget), and a status endpoint that exposes the full queue state including GPU metrics. Configuration is hot-reloadable: changes to config/llm_gateway.yaml take effect without restarting the worker.

Key Concepts¶

Concept	Description
LLMQueueWorker	Core worker loop. Maintains an internal `PriorityQueue` for Donna tasks and a standard `Queue` for external requests. Processes one item at a time, handling model swaps as needed.
Priority	Three levels: `CRITICAL` (0), `NORMAL` (1), `BACKGROUND` (2). Task types map to priorities via `config/llm_gateway.yaml`. Parse and challenge tasks are critical; nudges and reminders are background.
QueueItem	A single queued LLM call. Carries prompt, model, max tokens, JSON mode flag, an `asyncio.Future` for the result, priority, caller identity, and preemption state.
GpuTracker	Tracks which model is loaded on the GPU, rolling swap metrics (count, duration, overhead percentage), and alert thresholds. Single-worker access, not thread-safe.
RateLimiter	Per-caller sliding window limiter. Tracks request timestamps per caller with minute and hour windows. Counters survive hot-reload.
GatewayAlerter	Debounced Discord notifications for gateway events. Each (alert_type, caller) pair is debounced independently to prevent notification storms.
GatewayConfig	Parsed configuration dataclass. Covers scheduling (active hours), queue limits, rate limits, budget, cloud fallback, priority map, and GPU config.
Model Affinity	Within the same priority band, the worker prefers items that match the currently loaded GPU model, avoiding unnecessary swaps.
Preemption	During active hours, if an internal item arrives while an external request is running, the external request is cancelled, re-enqueued at the front of the external priority deque, and the internal item runs.
Home Model	The default GPU model (`qwen2.5:32b-instruct-q6_K`). After non-home work completes and no more non-home items are queued, the worker schedules a delayed restore to the home model.

Architecture¶

flowchart TD
    subgraph Callers
        INT[Donna Internal<br/>ModelRouter]
        EXT[External API<br/>Gateway Callers]
    end

    INT --> IQ["Internal PriorityQueue<br/>(CRITICAL > NORMAL > BACKGROUND)"]
    EXT --> EQ[External Queue]

    subgraph Worker["LLMQueueWorker.run()"]
        POP[Pop next item] --> AFF{Model Affinity<br/>Check}
        AFF --> SWAP{Need Model<br/>Swap?}
        SWAP -->|Yes| MS[Swap Model via Ollama]
        MS --> EXEC[Execute LLM Call]
        SWAP -->|No| EXEC
        EXEC --> FUT[Resolve Future]
        FUT --> HOME{Queue empty?<br/>Non-home loaded?}
        HOME -->|Yes| HR[Schedule Home Restore]
        HOME -->|No| POP
    end

    IQ --> POP
    EQ --> POP

    subgraph Preemption
        IQ -->|Item arrives during<br/>active hours| PRE[Cancel external task]
        PRE --> EPQ["Re-enqueue at front<br/>of external priority deque"]
    end

    subgraph Monitoring
        RL[RateLimiter] --> EQ
        GT[GpuTracker] --> ALERT[GatewayAlerter]
        ALERT --> DISC[Discord #donna-debug]
    end

Queue Priority and Selection¶

The worker's _pop_next() method implements a three-tier selection:

Internal queue first. Items are popped from the asyncio.PriorityQueue, which sorts by (priority, sequence). Within the same priority band, model affinity breaks ties: items matching the currently loaded GPU model are preferred over items that would require a swap.
External priority deque. Interrupted/continuation items that were preempted are placed at the front of this deque and served before new external items.
External queue. Standard FIFO for new external requests.

Preemption Protocol¶

When an internal item arrives while the worker is processing an external request during active hours:

The current external asyncio.Task is cancelled.
The QueueItem is marked as interrupted (interrupt_count++).
If the interrupt count exceeds max_interrupt_count, a starvation alert fires.
The interrupted item is placed at the front of the external priority deque.
The internal item is processed next.

GPU Model Management¶

The GpuTracker maintains state about the currently loaded Ollama model:

Swap recording. Each model swap records from-model, to-model, start time, and duration. The last 200 swaps are kept in a deque for rolling metrics.
Execution recording. Each LLM execution records its duration. The last 500 are kept for overhead calculation.
Overhead metric. swap_overhead_pct_1h = (total swap time) / (total swap time + execution time) over the last hour. High overhead means too many model switches.
Home restore. After non-home work completes, the worker checks if any remaining queued items require a non-home model. If not, it schedules a delayed restore (default 30s) to the home model.

Alert Thresholds¶

Alert	Trigger	Debounce
Rate limited	Caller exceeds RPM or RPH	Per-caller, configurable
Queue depth	External queue exceeds `queue_depth_warning` (default 10)	Global
Queue full	External queue at `max_external_depth` (default 20)	Per-caller
Starvation	External request interrupted `max_interrupt_count`+ times	Per-caller
GPU swap frequency	More than `swaps_per_hour_warning` (default 4) swaps/hour	Per-alert-type
GPU swap latency	Last swap exceeded `swap_wait_ms_warning` (default 60s)	Per-alert-type
GPU swap overhead	Overhead exceeds `swap_overhead_pct_warning` (default 25%)	Per-alert-type

All alerts route through GatewayAlerter, which sends to Discord #donna-debug with per-(type, caller) debouncing (default 10 minutes).

Rate Limiting¶

The RateLimiter implements per-caller sliding window rate limiting:

Minute window. Configurable RPM per caller (default 10). Timestamps older than 60s are pruned on each check.
Hour window. Configurable RPH per caller (default 100). Timestamps older than 3600s are pruned.
Startup rebuild. rebuild_from_records() creates synthetic timestamps from invocation_log call counts to restore approximate state after restart.
Hot-reload. update_limits() changes limits without clearing counters.

Configuration¶

Primary config: config/llm_gateway.yaml

scheduling:
  active_hours: "06:00-22:00"
  schedule_drain_minutes: 2

queue:
  max_external_depth: 20
  max_interrupt_count: 3

priority_map:
  parse_task: critical
  challenge_task: critical
  generate_digest: normal
  generate_nudge: background
  generate_reminder: background

rate_limits:
  default:
    requests_per_minute: 10
    requests_per_hour: 100
  callers: {}

budget:
  daily_external_usd: 5.00
  alert_pct: 80

gpu:
  home_model: "qwen2.5:32b-instruct-q6_K"
  swap_timeout_s: 120
  restore_home_delay_s: 30
  alerts:
    swaps_per_hour_warning: 4
    swap_wait_ms_warning: 60000
    swap_overhead_pct_warning: 25

The config is hot-reloadable via LLMQueueWorker.reload_config(). Queue contents and counters are preserved across reloads.

API¶

Class / Function	Module	Description
`LLMQueueWorker`	`queue.py`	`enqueue_internal()`, `enqueue_external()` (return `asyncio.Future`), `process_one()`, `run()`, `stop()`, `get_status()`, `get_item()`, `reload_config()`, `preempt_external()`.
`GpuTracker`	`gpu_tracker.py`	`record_loaded()`, `record_swap_started()`, `record_swap_completed()`, `record_execution_time()`, `get_metrics()`, `check_alerts()`. Properties: `loaded_model`, `is_home`, `home_model`, `swaps_this_hour`.
`RateLimiter`	`rate_limiter.py`	`check(caller)` (returns bool), `get_usage(caller)`, `get_all_usage()`, `rebuild_from_records()`, `update_limits()`.
`GatewayAlerter`	`alerter.py`	`alert_rate_limited()`, `alert_queue_depth()`, `alert_queue_full()`, `alert_starvation()`, `alert_budget()`, `send_alert()`.
`GatewayConfig`	`types.py`	`priority_for_task_type()`, `is_active_hours()`, `rpm_for_caller()`, `rph_for_caller()`.
`load_gateway_config`	`types.py`	Parses `config/llm_gateway.yaml` into `GatewayConfig`.
`Priority`	`types.py`	IntEnum: `CRITICAL(0)`, `NORMAL(1)`, `BACKGROUND(2)`.
`QueueItem`	`types.py`	Dataclass with `__lt__` for PriorityQueue ordering.

Status Endpoint¶

get_status() returns a dict suitable for the /llm/queue/status API:

{
  "current_request": { "sequence": 42, "type": "internal", "model": "...", ... },
  "internal_queue": { "pending": 2, "next_items": [...] },
  "external_queue": { "pending": 5, "next_items": [...] },
  "stats_24h": { "internal_completed": 150, "external_completed": 30, "external_interrupted": 2 },
  "rate_limits": { "caller_a": { "minute": "3/10", "hour": "45/100" } },
  "mode": "active",
  "gpu": { "loaded_model": "qwen2.5:32b-instruct-q6_K", "is_home": true, "swaps_this_hour": 1, ... }
}