A 5-line eval that catches 80% of hallucinations
Before you ship a generated answer, run one cheap second call that checks it against the source. Ask: is every claim in the answer supported by the provided context? Get back yes/no plus the first unsupported claim. Reject on no. Five lines of glue, catches the bulk of confident fabrications.
On this page
Don't grade fluency. Grade grounding. A hallucination is a claim with no support in the context — so ask exactly that, and nothing else.
You are a strict fact-checker. You get CONTEXT and an ANSWER.
Check: is every factual claim in ANSWER directly supported by CONTEXT?
Reply with JSON: {"grounded": boolean, "first_unsupported": string|null}.
Do not use outside knowledge. If a claim is not in CONTEXT, it is unsupported.
CONTEXT:
{{context}}
ANSWER:
{{answer}}
Reject and regenerate when grounded is false. Log first_unsupported — that's your hallucination feed, free.
Receipt
- Model: Haiku 4.5 (
claude-haiku-4-5) - Cost: ~600 in + 30 out tokens ≈ $0.0009 / check
- Catch rate: 80% of fabricated claims on our RAG eval set (n=500)
Why it works
- Single axis. Grounding is binary and checkable; "quality" is not. The model has exactly one job.
- Closed-book instruction. "Do not use outside knowledge" stops the checker from rubber-stamping plausible-but-absent claims.
- The
first_unsupportedfield forces evidence. A model that must quote the bad claim can't hand-wave a pass.
Failure mode
- Paraphrase blindness. Tight paraphrases of the context sometimes flag as unsupported. Accept a small false-reject rate or add "a faithful paraphrase counts as supported."
- Garbage context. If CONTEXT itself is wrong, the checker happily grounds a wrong answer. This guards faithfulness, not truth.
Cost to test: $0.0009 / call.
Sources
- RAG faithfulness benchmark, internal, 500 labeled pairs.
- Anthropic pricing for Haiku 4.5.
FAQ
Why not just lower the temperature?
Temperature reduces variance, not fabrication. A deterministic model still invents unsupported claims; this eval catches them after the fact.
Won't a second call double my latency?
It adds one Haiku round-trip (~200ms). Buffer the final answer and run the check before flushing. Cheap insurance versus shipping a lie.