tool-use
Tool-forced JSON
Every LLM call goes through Anthropic's tool-use API and is validated by Pydantic. No JSON parse failures. One place owns the schema.
You feed it a ticket. It picks a category, drafts a reply, and decides whether a human should pick it up. Three Sonnet calls. Tool-use JSON so the schema can't drift. Per-call dollar cost. An eval set runs on every PR and blocks the merge if quality drops.
Cost / ticket
$0.008
3 sonnet calls
P50 latency
2.10s
end-to-end
Tests passing
73 / 73
pytest · CI on every PR
Branch coverage
92.95%
gate at 75%
A run is three Sonnet calls. Output below comes back as three panels (classify, draft, escalate) with confidence bars and any FAQ citations the drafter pulled.
Preset cases05 loaded
⌘↵to execute · 3 LLM calls · ≈ $0.008
tool-use
Every LLM call goes through Anthropic's tool-use API and is validated by Pydantic. No JSON parse failures. One place owns the schema.
calibration
The classifier's confidence is forced to be justified. Eval harness measures calibration vs ground truth on every PR.
rag
Drafter retrieves top-k chunks via pgvector cosine over Voyage-3 embeddings, then cites them inline. No FAQ → no citations.
cost
Tokens come from `usage.input_tokens` / `usage.output_tokens`, not estimated. Per-call costs roll up into a total per ticket.
ci
50-row golden set runs on every PR. >5% regression on any metric blocks merge. Sticky PR comment posts the diff.
production
mypy --strict, ruff. 75% coverage gate. Langfuse tracing. Alembic migrations. Modal deploy. `make ci` runs clean on a fresh clone.