inbox-agentInbox Agent/OperatorONLINE00:00:00ZTry it
§00Missionagentic customer support · v0.1.0

An AI support agent
with the wiring you'd expect.

You feed it a ticket. It picks a category, drafts a reply, and decides whether a human should pick it up. Three Sonnet calls. Tool-use JSON so the schema can't drift. Per-call dollar cost. An eval set runs on every PR and blocks the merge if quality drops.

Pipeline · stagesIDLE
CLASSIFY01DRAFT02ESCALATE03

Cost / ticket

$0.008

3 sonnet calls

P50 latency

2.10s

end-to-end

Tests passing

73 / 73

pytest · CI on every PR

Branch coverage

92.95%

gate at 75%

§01Live demopaste · run · trace

Paste one of these or your own.

A run is three Sonnet calls. Output below comes back as three panels (classify, draft, escalate) with confidence bars and any FAQ citations the drafter pulled.

Preset cases05 loaded

Ticket input0000 chars

to execute · 3 LLM calls · ≈ $0.008

§02How it's wiredsix pieces

Six pieces of plumbing that keep this thing honest in CI.

01

tool-use

Tool-forced JSON

Every LLM call goes through Anthropic's tool-use API and is validated by Pydantic. No JSON parse failures. One place owns the schema.

02

calibration

Calibrated confidence

The classifier's confidence is forced to be justified. Eval harness measures calibration vs ground truth on every PR.

03

rag

FAQ retrieval, not invention

Drafter retrieves top-k chunks via pgvector cosine over Voyage-3 embeddings, then cites them inline. No FAQ → no citations.

04

cost

Real cost accounting

Tokens come from `usage.input_tokens` / `usage.output_tokens`, not estimated. Per-call costs roll up into a total per ticket.

05

ci

Eval gate in CI

50-row golden set runs on every PR. >5% regression on any metric blocks merge. Sticky PR comment posts the diff.

06

production

Production engineering

mypy --strict, ruff. 75% coverage gate. Langfuse tracing. Alembic migrations. Modal deploy. `make ci` runs clean on a fresh clone.