Hallucinations and the Evaluation Problem
Why models confidently make things up, what causes it, what reduces it, and how to measure progress on a moving target.
Summary#
A language model hallucinates when it produces an output that is fluent, confident, and wrong. The output is usually not random nonsense — it is plausible-looking text that fails to match reality, often in ways that are hard to detect without external verification. Citations that do not exist. Function signatures that do not exist. People who do not exist. Events that did not happen. Numbers that look right but are not.
This is not a bug in the training pipeline that future engineers will fix. It is a structural property of how language models work — they are trained to produce probable continuations of text, and “probable” is not the same as “true.” The right framing is harm-reduction and detection, not elimination. The evaluation problem is the companion: knowing whether the new model version is more reliable than the old, in the dimensions you care about, is genuinely hard and getting harder as capabilities improve.
What’s changing#
Three developments have changed the shape of this problem since 2023.
Hallucination rates have come down measurably, but unevenly. Frontier models in 2026 hallucinate substantially less than 2023 models on common benchmarks (TruthfulQA, factuality probes, citation accuracy on retrieval-augmented tasks). The improvement comes from larger and cleaner pretraining data, better post-training (RLHF specifically penalizes confident wrongness), and inference-time techniques (retrieval, tool use, chain-of-thought verification). But the improvement is uneven — common knowledge improves faster than niche knowledge, English faster than other languages, recent facts faster than historical ones.
Retrieval-augmented generation has become the default for fact-sensitive deployment. Rather than trying to fix the model’s parametric knowledge, the dominant pattern is to constrain the model to answer from retrieved documents and refuse or hedge when the documents do not support an answer. This shifts the hallucination problem to retrieval quality and grounding faithfulness, both of which are hard but more tractable than fixing the base model.
Evaluation has lagged capability. Models that score 95% on the public benchmarks still produce wrong-but-confident outputs in production. The benchmarks have saturated; the failure modes have moved to harder-to-evaluate dimensions (multi-step reasoning errors, subtle factual drift in long outputs, confidently wrong code that compiles but does not do what was asked). New benchmarks come out monthly and saturate within a year of release.
Open problems#
The structurally hard problems behind hallucination and evaluation are these.
Knowing what you do not know. A reliable system should be able to say “I am not confident about this” or “I do not know.” Current models do this poorly. They are trained to produce a continuation, and the training data does not contain many examples of “I do not know” being the correct answer. Calibration techniques (giving the model an explicit way to express uncertainty, training on epistemic-uncertainty examples, fine-tuning on refusals) help but do not solve. The root issue is that the model’s internal “confidence” — the probability mass it assigns to a token — is only loosely correlated with whether the content is correct.
Grounding faithfulness. When a model is given retrieved documents and asked to answer, “faithful” means the answer follows from the documents and only from the documents. Real models often blend retrieved content with parametric knowledge, sometimes correctly (filling in unambiguous context) and sometimes incorrectly (hallucinating details not in the documents). Measuring faithfulness automatically is itself an evaluation problem — you need a model to compare the answer to the documents, and that model can be wrong.
Multi-step reasoning failures. Reasoning-trained models do not hallucinate less; they hallucinate at intermediate steps in ways that propagate through the rest of the trace. A confident wrong step early in a long chain produces a confident wrong conclusion. Detecting this requires evaluating the reasoning trace, not just the final answer, which most evaluation pipelines do not do.
Evaluation at production scale. Knowing whether your deployed system is regressing is much harder than running benchmark scores. You need an evaluation set that reflects your actual traffic, automatic graders that match human judgement on your domain, and the operational infrastructure to run evals on every model version and every prompt change. Most production teams have some of this; very few have all of it.
Adversarial hallucination. Hallucinations are not always organic — they can be elicited. Prompts designed to make the model confidently invent content do so reliably. This becomes a safety issue when the attacker controls part of the model’s input (retrieved documents, tool outputs, user prompts) and uses that control to produce misleading outputs other users will see.
Risks and mitigations#
Concrete patterns that reduce hallucination in production, in rough order of leverage.
Retrieval-augmented generation done well. Strong retrieval (dense + sparse + reranker), explicit citation requirements, and prompts that instruct the model to refuse when documents do not support an answer. The “done well” matters — bad retrieval makes the problem worse, not better, by giving the model irrelevant context to anchor its hallucinations on.
Structured output and tool use. Constraining the model to output JSON matching a schema, to call functions with typed arguments, or to fetch data from authoritative APIs reduces the surface for free-form fabrication. The model still hallucinates inside the structure (wrong values for the right fields) but the structural framing is enforced.
Chain-of-thought with verification. Generating reasoning steps before the final answer reduces some categories of error. Verifying the reasoning against external sources (re-checking citations, recomputing numbers, running the code) catches some of the remaining errors. Inference-time compute spent on verification is one of the best-leveraged costs in a reliability budget.
Refusal training for low-confidence states. Modern frontier models are fine-tuned to express uncertainty or decline to answer when they should. The training is imperfect — models still produce confident answers on things they should hedge on — but it is meaningfully better than 2023.
External grounding for high-stakes outputs. When the cost of a wrong answer is high (medical, legal, financial), the architecture should route through a verification layer: a second model that critiques the first, a deterministic check, or a human reviewer. The pattern that holds up in production is “model generates, verifier checks, low-confidence outputs escalate.”
The two approaches are complementary. Frontier labs invest in the first; product teams invest in the second; the production stack uses both.
Evaluation as a first-class engineering practice. The teams that ship reliably have an evaluation harness that runs automatically on every change, an eval set that reflects real traffic, and a process for adding new evals when new failure modes are discovered. The teams that ship unreliably do not. This is the single biggest predictor of production AI quality, and it is the part of the stack that gets the least attention in the public discourse.
What to watch#
Things worth tracking specifically in this area through 2026 and 2027.
Hallucination benchmarks that do not saturate. Most current benchmarks are saturated by frontier models. New benchmarks designed with harder-to-game failure modes — long-form factuality, multi-step reasoning correctness, citation faithfulness, adversarially constructed examples — are where the meaningful signal will be. Watch for benchmarks where frontier models score below 70%.
Automatic evaluation methodology. LLM-as-judge is the dominant pattern for scalable automatic evaluation, but it has known biases (length, position, self-preference). The next generation of automatic graders — pairwise comparison, calibrated rubrics, ensemble judging — is an active research area. Whether any of these gets reliable enough for production use is open.
Calibration research. Techniques that produce models whose stated confidence matches their actual reliability. This is harder than it sounds — the model has to know what it does not know, which is a deep architectural question. Watch for progress on “selective prediction” and “epistemic uncertainty.”
Grounding-faithfulness measurement. Better tools for measuring whether RAG outputs actually follow from retrieved documents. The current state of the art is improving but still has high false-positive and false-negative rates.
Domain-specific reliability work. Generic reliability is hard; reliability for specific high-value domains (medical, legal, financial, scientific) is where most of the practical investment is going. The patterns developed in those domains will likely propagate back to general use.
A practical heuristic for production deployment
If you are deploying an AI feature where wrong answers have a meaningful cost, build the evaluation harness before you build the feature. Specifically: collect 200-500 realistic examples with ground-truth answers, define what “correct” means with explicit grading rubrics, build the automatic grader (LLM-as-judge is fine to start), and run it on a baseline. Only then build the feature, and run the eval on every change. Teams that skip this end up shipping unreliable systems and only discovering the problems in production. The harness is cheaper than the incident response.
Related material#