A 5-line eval that catches 80% of hallucinations

Sam Q.

Sam Q.June 18, 20262 min read110 views

A 5-line eval that catches 80% of hallucinations

Before you ship a generated answer, run one cheap second call that checks it against the source. Ask: is every claim in the answer supported by the provided context? Get back yes/no plus the first unsupported claim. Reject on no. Five lines of glue, catches the bulk of confident fabrications.

Updated on June 30, 2026

Yellow magnifying glass over three claim lines with an Anthropic glyph, minimalist white-and-amber PromptAttic hero.

On this page

Don't grade fluency. Grade grounding. A hallucination is a claim with no support in the context, so ask exactly that, and nothing else.

You are a strict fact-checker. You get CONTEXT and an ANSWER.
Check: is every factual claim in ANSWER directly supported by CONTEXT?
Reply with JSON: {"grounded": boolean, "first_unsupported": string|null}.
Do not use outside knowledge. If a claim is not in CONTEXT, it is unsupported.

CONTEXT:
{{context}}

ANSWER:
{{answer}}

Reject and regenerate when grounded is false. Log first_unsupported. That is your hallucination feed, free.

Receipt

Model: Haiku 4.5 (claude-haiku-4-5)
Cost: ~600 in + 30 out tokens ≈ $0.0009 / check
Catch rate: 80% of fabricated claims on our RAG eval set (n=500)

Why it works

Single axis. Grounding is binary and checkable; "quality" is not. The model has exactly one job.
Closed-book instruction. "Do not use outside knowledge" stops the checker from rubber-stamping plausible-but-absent claims.
The first_unsupported field forces evidence. A model that must quote the bad claim can't hand-wave a pass.

Failure mode

Paraphrase blindness. Tight paraphrases of the context sometimes flag as unsupported. Accept a small false-reject rate or add "a faithful paraphrase counts as supported."
Garbage context. If CONTEXT itself is wrong, the checker happily grounds a wrong answer. This guards faithfulness, not truth.

Cost to test: $0.0009 / call.

Field log

For a longer write-up of where wrong-shaped agent output bites in practice, see this DevMoment field log of 7 Claude Code subagents, 4 kept and 3 deleted. The "must include file:line for every claim" prompt addition in that post is the same defense the 5-line eval here automates, applied at the source instead of the sink.

For the bigger picture on what these eval costs become when you compound them across power users inside a paying SaaS, a recent teardown of three pricing ladders for AI features (with COGS math at the median, 90th, and 99th percentile) is worth a read.

Sources

RAG faithfulness benchmark, internal, 500 labeled pairs.
Anthropic pricing for Haiku 4.5.

#eval #hallucination #haiku #guardrails

Back to library

Share

S

Written by

Sam Q.

Applied AI engineer. Writes prompt recipes that survive contact with production.

FAQ

Why not just lower the temperature?

Temperature reduces variance, not fabrication. A deterministic model still invents unsupported claims; this eval catches them after the fact.

Won't a second call double my latency?

It adds one Haiku round-trip (~200ms). Buffer the final answer and run the check before flushing. Cheap insurance versus shipping a lie.