Cheap-then-deep routing: Haiku 4.5 first, Opus 4.7 only when it earns it
Route every call to Haiku 4.5 first. Ask it one thing: can you answer this with high confidence? If yes, ship its answer. If no, replay the same request to Opus 4.7. You pay premium tokens only on the hard 10–20%. Net cost drops ~70% with no quality loss on easy traffic.
On this page
Two-tier routing. Cheap model triages. Expensive model only on escalation. Most traffic is easy — stop paying Opus rates for it.
Run this as a gate before any expensive call:
You are a triage router. Read the user request below.
Answer ONLY with a JSON object: {"confidence": 0-1, "answer": string|null}.
Rules:
- If you can fully answer with high confidence, set answer to your response and confidence >= 0.8.
- If the request is ambiguous, multi-step, or needs deep reasoning, set answer to null and confidence < 0.5.
- Never guess. Low confidence is cheaper than a wrong answer.
Request:
{{user_input}}
Wire it: if confidence >= 0.8, ship answer. Else replay the original request to Opus 4.7. One branch. No orchestration framework, no router model zoo.
Receipt
- Triage model: Haiku 4.5 (
claude-haiku-4-5) - Escalation model: Opus 4.7 (
claude-opus-4-7) - Triage cost: ~320 in + 40 out tokens ≈ $0.0004
- Blended cost across mixed traffic: ~$0.004 / call (≈70% cheaper than all-Opus)
Why it works
- Easy traffic dominates. 80% of calls never need a frontier model — Haiku closes them for a tenth of a cent.
- Confidence is self-reported but calibrated. Forcing a 0–1 score makes the model commit instead of bluffing.
- One escalation hop. No embeddings, no classifier to retrain, no second system to keep in sync.
Failure mode
- Overconfident Haiku. It occasionally rates a wrong answer 0.9. Sample 1% of shipped answers and grade them; if the false-confident rate climbs, raise the threshold to 0.9.
- Escalation storms. A bad input class can push everything to Opus. Alert when the escalation rate crosses 35%.
Cost to test: $0.004 / call.
Sources
- Anthropic model card: Haiku 4.5 latency and price.
- Internal A/B: 12k production calls, mixed support traffic.
FAQ
What confidence threshold should I start with?
0.8. Strict enough to keep wrong answers from shipping, loose enough that Haiku still closes most easy calls. Tune from there using a sampled grade.
Does this add latency?
On easy calls, no — Haiku is faster than Opus, so you often beat a single Opus call. Escalated calls pay one extra Haiku round-trip, roughly 200ms.
Can I add Sonnet as a middle tier?
Yes. Haiku → Sonnet → Opus is the three-tier variant. Add it only if your escalation rate is high and Sonnet closes most of it cheaper than Opus.