Eval
Sam Q.1 min read

A 5-line eval that catches 80% of hallucinations

Before you ship a generated answer, run one cheap second call that checks it against the source. Ask: is every claim in the answer supported by the provided context? Get back yes/no plus the first unsupported claim. Reject on no. Five lines of glue, catches the bulk of confident fabrications.

Analytics dashboard on a laptop screen, representing output checks
Analytics dashboard on a laptop screen, representing output checks
On this page

Don't grade fluency. Grade grounding. A hallucination is a claim with no support in the context — so ask exactly that, and nothing else.

You are a strict fact-checker. You get CONTEXT and an ANSWER.
Check: is every factual claim in ANSWER directly supported by CONTEXT?
Reply with JSON: {"grounded": boolean, "first_unsupported": string|null}.
Do not use outside knowledge. If a claim is not in CONTEXT, it is unsupported.

CONTEXT:
{{context}}

ANSWER:
{{answer}}

Reject and regenerate when grounded is false. Log first_unsupported — that's your hallucination feed, free.

Receipt

  • Model: Haiku 4.5 (claude-haiku-4-5)
  • Cost: ~600 in + 30 out tokens ≈ $0.0009 / check
  • Catch rate: 80% of fabricated claims on our RAG eval set (n=500)

Why it works

  • Single axis. Grounding is binary and checkable; "quality" is not. The model has exactly one job.
  • Closed-book instruction. "Do not use outside knowledge" stops the checker from rubber-stamping plausible-but-absent claims.
  • The first_unsupported field forces evidence. A model that must quote the bad claim can't hand-wave a pass.

Failure mode

  • Paraphrase blindness. Tight paraphrases of the context sometimes flag as unsupported. Accept a small false-reject rate or add "a faithful paraphrase counts as supported."
  • Garbage context. If CONTEXT itself is wrong, the checker happily grounds a wrong answer. This guards faithfulness, not truth.

Cost to test: $0.0009 / call.

Sources

  • RAG faithfulness benchmark, internal, 500 labeled pairs.
  • Anthropic pricing for Haiku 4.5.
S

Written by

Sam Q.

Applied AI engineer. Writes prompt recipes that survive contact with production.

FAQ

Why not just lower the temperature?

Temperature reduces variance, not fabrication. A deterministic model still invents unsupported claims; this eval catches them after the fact.

Won't a second call double my latency?

It adds one Haiku round-trip (~200ms). Buffer the final answer and run the check before flushing. Cheap insurance versus shipping a lie.