Cheap-then-deep routing: Haiku 4.5 first, Opus 4.7 only when it earns it

Sam Q.

Sam Q.June 18, 20261 min read108 views

Cheap-then-deep routing: Haiku 4.5 first, Opus 4.7 only when it earns it

Route every call to Haiku 4.5 first. Ask it one thing: can you answer this with high confidence? If yes, ship its answer. If no, replay the same request to Opus 4.7. You pay premium tokens only on the hard 10–20%. Net cost drops ~70% with no quality loss on easy traffic.

Updated on June 30, 2026

Two-tier router diagram in amber on white with a small Anthropic glyph, minimalist PromptAttic hero illustration.

On this page

Two-tier routing. Cheap model triages. Expensive model only on escalation. Most traffic is easy; stop paying Opus rates for it.

Run this as a gate before any expensive call:

You are a triage router. Read the user request below.
Answer ONLY with a JSON object: {"confidence": 0-1, "answer": string|null}.

Rules:
- If you can fully answer with high confidence, set answer to your response and confidence &gt;= 0.8.
- If the request is ambiguous, multi-step, or needs deep reasoning, set answer to null and confidence &lt; 0.5.
- Never guess. Low confidence is cheaper than a wrong answer.

Request:
{{user_input}}

Wire it: if confidence >= 0.8, ship answer. Else replay the original request to Opus 4.7. One branch. No orchestration framework, no router model zoo.

Receipt

Triage model: Haiku 4.5 (claude-haiku-4-5)
Escalation model: Opus 4.7 (claude-opus-4-7)
Triage cost: ~320 in + 40 out tokens ≈ $0.0004
Blended cost across mixed traffic: ~$0.004 / call (≈70% cheaper than all-Opus)

Why it works

Easy traffic dominates. 80% of calls never need a frontier model. Haiku closes them for a tenth of a cent.
Confidence is self-reported but calibrated. Forcing a 0–1 score makes the model commit instead of bluffing.
One escalation hop. No embeddings, no classifier to retrain, no second system to keep in sync.

Failure mode

Overconfident Haiku. It occasionally rates a wrong answer 0.9. Sample 1% of shipped answers and grade them; if the false-confident rate climbs, raise the threshold to 0.9.
Escalation storms. A bad input class can push everything to Opus. Alert when the escalation rate crosses 35%.

Cost to test: $0.004 / call.

Routing cuts the per-call token bill. The platform underneath sets the rest. BudgetForge's 30-day pricing audit of 6 AI app builders walks the all-in monthly math before any routing kicks in.

Sources

Anthropic model card: Haiku 4.5 latency and price.
Internal A/B: 12k production calls, mixed support traffic.

#routing #haiku #opus #cost

Back to library

Share

S

Written by

Sam Q.

Applied AI engineer. Writes prompt recipes that survive contact with production.

FAQ

What confidence threshold should I start with?

0.8. Strict enough to keep wrong answers from shipping, loose enough that Haiku still closes most easy calls. Tune from there using a sampled grade.

Does this add latency?

On easy calls, no — Haiku is faster than Opus, so you often beat a single Opus call. Escalated calls pay one extra Haiku round-trip, roughly 200ms.

Can I add Sonnet as a middle tier?

Yes. Haiku → Sonnet → Opus is the three-tier variant. Add it only if your escalation rate is high and Sonnet closes most of it cheaper than Opus.