Routing
Sam Q.1 min read

Cheap-then-deep routing: Haiku 4.5 first, Opus 4.7 only when it earns it

Route every call to Haiku 4.5 first. Ask it one thing: can you answer this with high confidence? If yes, ship its answer. If no, replay the same request to Opus 4.7. You pay premium tokens only on the hard 10–20%. Net cost drops ~70% with no quality loss on easy traffic.

Close-up of a circuit board, representing model routing paths
Close-up of a circuit board, representing model routing paths
On this page

Two-tier routing. Cheap model triages. Expensive model only on escalation. Most traffic is easy — stop paying Opus rates for it.

Run this as a gate before any expensive call:

You are a triage router. Read the user request below.
Answer ONLY with a JSON object: {"confidence": 0-1, "answer": string|null}.

Rules:
- If you can fully answer with high confidence, set answer to your response and confidence >= 0.8.
- If the request is ambiguous, multi-step, or needs deep reasoning, set answer to null and confidence < 0.5.
- Never guess. Low confidence is cheaper than a wrong answer.

Request:
{{user_input}}

Wire it: if confidence >= 0.8, ship answer. Else replay the original request to Opus 4.7. One branch. No orchestration framework, no router model zoo.

Receipt

  • Triage model: Haiku 4.5 (claude-haiku-4-5)
  • Escalation model: Opus 4.7 (claude-opus-4-7)
  • Triage cost: ~320 in + 40 out tokens ≈ $0.0004
  • Blended cost across mixed traffic: ~$0.004 / call (≈70% cheaper than all-Opus)

Why it works

  • Easy traffic dominates. 80% of calls never need a frontier model — Haiku closes them for a tenth of a cent.
  • Confidence is self-reported but calibrated. Forcing a 0–1 score makes the model commit instead of bluffing.
  • One escalation hop. No embeddings, no classifier to retrain, no second system to keep in sync.

Failure mode

  • Overconfident Haiku. It occasionally rates a wrong answer 0.9. Sample 1% of shipped answers and grade them; if the false-confident rate climbs, raise the threshold to 0.9.
  • Escalation storms. A bad input class can push everything to Opus. Alert when the escalation rate crosses 35%.

Cost to test: $0.004 / call.

Sources

  • Anthropic model card: Haiku 4.5 latency and price.
  • Internal A/B: 12k production calls, mixed support traffic.
S

Written by

Sam Q.

Applied AI engineer. Writes prompt recipes that survive contact with production.

FAQ

What confidence threshold should I start with?

0.8. Strict enough to keep wrong answers from shipping, loose enough that Haiku still closes most easy calls. Tune from there using a sampled grade.

Does this add latency?

On easy calls, no — Haiku is faster than Opus, so you often beat a single Opus call. Escalated calls pay one extra Haiku round-trip, roughly 200ms.

Can I add Sonnet as a middle tier?

Yes. Haiku → Sonnet → Opus is the three-tier variant. Add it only if your escalation rate is high and Sonnet closes most of it cheaper than Opus.