Why AI Models Hallucinate

In September 2025, OpenAI published a paper that said something the AI industry already suspected but hadn’t quite articulated. The paper, “Why Language Models Hallucinate”, authored by Adam Tauman Kalai, Ofir Nachum, Santosh Vempala, and Edwin Zhang, didn’t just catalogue the problem. It pointed the finger at the evaluation systems that are supposed to keep models honest and argued that those systems are actively making hallucination worse.

The paper’s central argument is disarmingly simple. Language models hallucinate because we reward them for guessing. The training loops, the benchmarks, the leaderboards that determine which model gets called “best” all operate on a scoring system that treats confident wrong answers and honest uncertainty as equally worthless. Under those rules, the rational strategy for any model is to always take a shot, even when the evidence is thin. And that strategy produces hallucinations.

Researchers have known for years that models tend toward overconfidence. But the OpenAI paper formalised it with mathematical precision and made an argument that goes further than most. The problem is that our entire evaluation infrastructure systematically incentivises the specific failure mode we claim to care most about fixing.

An illustration representing hallucination

The Mechanics of Making Things Up

To understand why the paper matters, it helps to start with what hallucination actually is at a mechanical level.

During pretraining, a language model learns to predict the next token in a sequence. It ingests billions of documents and builds a statistical model of what words tend to follow other words in what contexts. This process is extraordinarily powerful for capturing patterns, grammar, reasoning structures, and factual associations. But it has an inherent limitation that no amount of scale can fully overcome.

Some facts appear in training data frequently enough that the model can learn them reliably. The capital of France, the boiling point of water, the year the Berlin Wall fell. These are high-frequency, well-attested facts that leave strong statistical signals. But other facts appear rarely or only once. The title of a specific researcher’s PhD dissertation. The birthday of a mid-career academic. The precise holdings of a niche legal case from 2019. These “singleton” facts leave weak or ambiguous traces in the training distribution, and no model, regardless of size, can learn them with confidence from pattern matching alone.

The OpenAI paper draws an analogy to supervised learning that makes this intuitive. In any classification task, there’s an irreducible error rate determined by the overlap between classes in the training data. Generative models face an equivalent problem, because some questions simply cannot be answered correctly from the training distribution, and the model’s best option in those cases would be to say “I don’t know.” The paper refers to this as the model’s “singleton rate,” the fraction of facts that appeared only once during training and therefore can’t be reliably recalled.

This matters because it puts a hard floor under hallucination rates regardless of model size or architecture. You can make a model bigger, train it on more data, and give it better reasoning capabilities, and you will reduce hallucinations on well-attested facts. But you will never eliminate them on rare facts, because the statistical signal for those facts is too weak to distinguish from noise. The paper is explicit about this point. Even a 100% accurate model on common facts would still hallucinate on singleton facts, and the only alternative to hallucination on those facts is abstention.

None of this is mysterious. It’s basic statistics applied to language modelling. But what happens next, in the post-training phase, is where things go wrong in a more avoidable way.

The Test-Taking Incentive Problem

After pretraining, models go through rounds of fine-tuning designed to make them more helpful, less harmful, and better at following instructions. This process involves evaluation on benchmarks, and it’s here that the OpenAI paper identifies the core dysfunction.

The paper’s authors compare modern AI benchmarks to multiple-choice tests where leaving an answer blank guarantees zero points. On such tests, the optimal strategy for a test-taker who doesn’t know the answer is to guess. There’s some chance of being right, and no additional penalty for being wrong. Language model benchmarks work on the same principle, and most prominent evaluations, including MMLU-Pro, GPQA, MATH, and others that dominate public leaderboards, use binary scoring where a correct answer scores one point and everything else, whether wrong or abstained, scores zero.

Under this system, a model that says “I don’t know” to a question it’s uncertain about gets exactly the same score as a model that confidently invents an answer. But the model that guesses will occasionally be right by chance, which pushes its aggregate accuracy higher. Since accuracy is the number that appears on leaderboards, in model cards, and in press releases, the models that guess most aggressively tend to look best.

The paper illustrates this with a concrete example from SimpleQA-style metrics. One model showed an error rate of 75% with only 1% abstentions, meaning it almost never admitted uncertainty and was wrong three-quarters of the time when it did answer. Another model abstained 52% of the time and dramatically reduced its error rate. But on a traditional accuracy-only leaderboard, the difference between these two models would look modest, because the metric that gets reported doesn’t distinguish between “wrong” and “chose not to answer.”

This is not an edge case in how benchmarks work. It’s the dominant paradigm. As the paper puts it, the majority of mainstream evaluations reward hallucinatory behaviour. The proposed fix is almost embarrassingly obvious, and borrowed directly from standardised testing. Introduce negative marking for wrong answers, or give partial credit for appropriate expressions of uncertainty, so that honest non-answers score better than confident mistakes.

Looking Inside the Black Box

While OpenAI approached the problem from the evaluation and incentive angle, Anthropic’s interpretability team was working on the same question from the opposite direction, looking at what actually happens inside a model when it decides whether to hallucinate or abstain.

In March 2025, Anthropic published two papers under the banner “Tracing the Thoughts of a Large Language Model” that used a novel “AI microscope” technique to map the computational circuits inside Claude 3.5 Haiku. Among the results was a discovery that runs counter to most people’s intuitions about how hallucination works.

It turns out that Claude’s default behaviour is to refuse to answer. The researchers identified a circuit that is active by default and causes the model to state that it has insufficient information to respond to any given question. This “I don’t know” circuit fires every time Claude receives a query, regardless of the topic. For the model to actually produce an answer, a competing mechanism has to override it. When Claude is asked about something it knows well, a “known entity” feature activates and inhibits the default refusal circuit, allowing the model to respond.

Hallucinations happen when this override misfires. The researchers showed that when Claude recognises a name but doesn’t actually know much about the person, the “known entity” feature can still activate, suppressing the refusal circuit and pushing the model into fabrication mode. By artificially manipulating these circuits in experiments, they could reliably induce hallucinations about fictional people, and by strengthening the refusal circuit, they could prevent them.

This result reframes hallucination as a circuit imbalance rather than a deep-seated flaw. The model already has the machinery to recognise uncertainty and decline to answer. The problem is that this machinery sometimes loses the tug-of-war with the model’s competing drive to produce fluent, helpful-sounding output. And that drive is reinforced by training regimes and evaluations that treat helpfulness as the primary virtue and treat caution as a failure.

The interpretability work and the OpenAI incentives paper are telling the same story from different vantage points. One looks at the external pressures that shape model behaviour and the other looks at the internal mechanisms those pressures create. Both arrive at the same conclusion. Models don’t hallucinate because they’re broken. They hallucinate because the systems we’ve built around them reward confident output and punish honest uncertainty.

Not All Hallucinations Come From the Model

The OpenAI and Anthropic work both locate hallucination inside the model, whether in its training incentives or its internal circuits. But a September 2025 paper in Frontiers in Artificial Intelligence by Anh-Hoang, Tran, and Nguyen adds a third variable that most evaluation frameworks ignore entirely, and that variable is the prompt itself.

The paper introduces formal metrics for separating prompt-induced hallucinations from model-intrinsic ones — three new acronyms to quantify what practitioners already know, which is that bad prompts make bad outputs worse. Conditional Prompt Sensitivity (CPS) measures how much hallucination rates change when you vary the prompt while holding the model constant. Conditional Model Variability (CMV) measures the reverse, how much rates change across models given the same prompt. A third metric, Joint Attribution Score (JAS), captures the interaction effect between the two.

The results are unambiguous. Vague, underspecified prompts dramatically increase hallucination rates in some models but not others. LLaMA 2 showed CPS values of 0.15 under ambiguous prompting, meaning prompt design accounted for a large share of its fabrication behaviour. GPT-4, by contrast, was far less prompt-sensitive (CMV of 0.08), suggesting its hallucinations were more model-intrinsic and less dependent on how the question was framed. Structured prompting techniques like Chain-of-Thought reduced CPS to 0.06 across the board, a meaningful drop that required no model changes at all.

The practical implication is that hallucination isn’t always a model problem. Sometimes it’s a prompting problem, and sometimes it’s both at once. Models with high JAS scores, like LLaMA 2 under ambiguous prompts (JAS of 0.12), show compounding effects where weak prompts and model limitations multiply each other’s worst tendencies. This means the standard evaluation practice of testing models with fixed prompt templates and attributing all variation to model quality is systematically misleading. Two teams using the same model with different prompt architectures could see wildly different hallucination rates, and neither team’s experience would be wrong.

This reframes the question of responsibility. If a model hallucinates because the prompt was ambiguous, is that a model failure or a deployment failure? Current benchmarks don’t ask this question. They test models under controlled prompting conditions and report a single hallucination rate, flattening a two-dimensional problem into one number. The Frontiers paper suggests that useful evaluation would need to test across a range of prompt qualities, measuring how often a model hallucinates and how sensitive it is to the way questions are asked.

How Evaluation Is Changing (Slowly)

Newer benchmarks are starting to incorporate abstention as a legitimate outcome, but they remain a minority voice in a field still dominated by accuracy-only scoring.

SimpleQA, released by OpenAI in late 2024, treats abstention as a first-class outcome. Each response is graded as correct, incorrect, or not attempted, which makes it possible to measure whether a model knows what it doesn’t know. This is a meaningful step, and the benchmark has been widely cited. But it covers only 4,326 short factual questions with single correct answers, which makes it narrow by design and increasingly saturated. GPT-4o with web search now reaches around 90% accuracy on SimpleQA, and GPT-5 with search and reasoning pushes above 95%, which means the benchmark is approaching its ceiling for models with access to external tools.

HalluLens, presented at ACL 2025, takes a broader approach. It includes multiple task types (short-form QA, long-form generation, and nonexistent entity detection) and explicitly measures both hallucination rates and false refusal rates, the cases where a model declines to answer something it actually knows. This dual measurement is important because it captures a tradeoff that SimpleQA alone misses.

A model that refuses everything would score perfectly on hallucination metrics but be useless in practice. HalluLens found substantial variation across models, with GPT-4o rarely refusing (4.13% false refusal rate) while Llama-3.1-8B-Instruct refused over 83% of the time. Neither extreme is desirable, and having both numbers visible forces a more honest conversation about what good behaviour looks like.

The most ambitious attempt to embed the OpenAI paper’s recommendations into a practical benchmark may be AA-Omniscience, published by Artificial Analysis in November 2025. Its central metric, the Omniscience Index, does exactly what the OpenAI paper prescribed. Correct answers earn +1 point, incorrect answers cost -1 point, and abstentions score zero. This means a model that guesses and gets it wrong is actively penalised relative to a model that admits it doesn’t know. The scale runs from -100 to 100, where zero means a model is correct as often as it is incorrect.

The results are striking, and somewhat grim. Out of 36 evaluated frontier models, only three scored above zero on the Omniscience Index. Claude 4.1 Opus led with 4.8, followed by GPT-5.1 at 2.0 and Grok 4 at 0.85. Every other model was more likely to hallucinate than to give a correct answer when measured on this basis. Models that look excellent on traditional accuracy benchmarks, including Grok 4 and GPT-5 variants, turned out to have hallucination rates of 64% and 81% respectively when their guessing behaviour was properly penalised.

The most recent entry is HalluHard, published in early 2026, which tackles something the earlier benchmarks mostly ignore. It tests hallucination in multi-turn, open-ended dialogue rather than single-turn factual questions. The reason is that errors compound across turns, and an early hallucination can contaminate the context that the model draws on for subsequent responses, creating a cascading failure that single-turn benchmarks can’t detect. HalluHard found that hallucinations remain substantial even for frontier models with web search access, and that models become progressively more prone to fabrication as conversations grow longer.

One of HalluHard’s more interesting results involves the interaction between reasoning ability and abstention. While more effective reasoning generally reduces hallucination, the effect is model-dependent. GPT-5.2 with reasoning enabled abstains significantly more than its non-reasoning counterpart, especially on niche knowledge questions, suggesting that deeper thinking makes the model more aware of its own knowledge boundaries. But this pattern doesn’t hold universally, and some models show the opposite behaviour, where reasoning makes them more confident rather than more cautious.

The benchmark also confirmed something the OpenAI paper predicted, that models struggle most with niche facts that have some trace in training data rather than with completely fabricated entities. When asked about something entirely made up, models are more likely to recognise it as unfamiliar and refuse to answer. But when asked about something they vaguely recognise without knowing well, they tend to guess, because the partial familiarity triggers the “known entity” response that Anthropic’s circuit analysis identified.

Work at the training level points in a more encouraging direction. A December 2025 paper on behaviourally calibrated reinforcement learning showed that a 4-billion-parameter model trained with proper calibration incentives could match or exceed frontier models on uncertainty quantification, despite being orders of magnitude smaller. The model’s signal-to-noise ratio gain (measuring the ratio of correct answers to hallucinations) substantially beat GPT-5 on challenging mathematical reasoning tasks, suggesting that teaching models when to abstain is a skill that can be learned independently of raw knowledge.

Where Evaluation Still Falls Short

Despite this progress, the structural problems the OpenAI paper identified remain largely intact. There are at least four ways in which the current evaluation system continues to fail.

The leaderboard problem persists. The benchmarks that drive public perception, model selection, and commercial decisions are still overwhelmingly accuracy-only. When a new model launches, the numbers that appear in the announcement blog post are accuracy on MMLU, pass rates on SWE-bench, scores on GPQA Diamond. These are the metrics that journalists report, that enterprise buyers compare, and that engineering teams optimise for. Benchmarks like AA-Omniscience and HalluLens exist but remain niche, and until the headline number on a model card includes a hallucination-penalising metric alongside accuracy, the incentive structure the OpenAI paper described will continue to push models toward confident guessing.

Single-turn factuality is an inadequate proxy for production behaviour. Most hallucination benchmarks test whether a model can correctly answer isolated factual questions. But the failure modes that actually hurt people in deployment are different. They involve subtle distortions in summaries, fabricated citations in legal research, invented details woven into otherwise accurate reports, and cascading errors in multi-turn conversations. HalluHard is a step toward tackling this, but it remains a single benchmark. The gap between “can this model answer trivia correctly” and “will this model produce reliable output in my specific workflow” is enormous, and very few evaluations attempt to bridge it.

Domain-specific hallucination is underexplored. AA-Omniscience shows dramatic variation across domains, with different models leading in different domains. A Stanford study in the Journal of Empirical Legal Studies found that even purpose-built legal AI tools like Westlaw AI produce responses that are not significantly more trustworthy than general-purpose models, with hallucinations that require close analysis of cited sources to detect.

A study in npj Digital Medicine found that GPT-4o hallucinated at a 53% rate on medical questions before targeted mitigation, dropping to 23% with improved prompting. These domain-specific rates are far higher than the aggregated numbers that appear on general leaderboards, and they vary in ways that general-purpose benchmarks don’t capture.

Retrieval-augmented generation doesn’t solve the problem. There’s a widespread assumption that giving models access to external documents through RAG architectures eliminates hallucination risk. The evidence doesn’t support this. Vectara’s hallucination leaderboard, which tests grounded summarisation where models are given source documents and asked to faithfully summarise them, still shows non-trivial inconsistency rates across all models tested.

The model can misread the source, over-generalise from it, or fill gaps between retrieved passages with invented material. RAG reduces the frequency of hallucination, but it changes the type rather than eliminating the problem. And because RAG-augmented models often cite their sources, the hallucinations they do produce carry an extra layer of false authority that makes them harder to catch.

The entire evaluation terrain is English-only and text-only. Nearly every benchmark discussed so far tests English-language factual questions in a text-to-text setting. This is a problem because hallucination rates spike dramatically once you step outside that narrow frame. Mu-SHROOM, a SemEval 2025 shared task that tested hallucination detection across 14 languages, found that hallucination rates and detection difficulty vary enormously by language, with low-resource languages showing far worse outcomes than English. The task attracted 2,618 submissions from 43 teams, a sign of the community’s recognition of this gap, and the results confirmed what many suspected. A model that is well-calibrated in English can be wildly overconfident in Swahili or Basque.

The multimodal picture is no better. CCHall, presented at ACL 2025, tests hallucination when models must reason across both languages and images simultaneously. Even the best-performing model (GPT-4o with a multi-agent debate framework) achieved only 77.5% accuracy, with performance dropping 10.9 points compared to handling cross-modal hallucinations alone.

The benchmark also found that longer model responses trigger substantially higher hallucination rates, with a sharp inflection point around 120 words, after which output reliability degrades significantly. These are not obscure failure modes. If you’re deploying a model to handle customer queries in multiple languages, or building a system that reasons over images and text together, your real-world hallucination rate is almost certainly higher than what any English-only benchmark would predict.

Enterprise evaluation is moving in the right direction but slowly. The Bessemer State of AI 2025 report noted that 2025 and 2026 would mark a turning point where AI evaluations go “private, grounded, and trusted,” with enterprises building domain-specific evaluation frameworks tailored to their own data and risk profiles.

This is encouraging, but it is a shift toward bespoke testing that doesn’t feed back into the public benchmarks that shape model development. If enterprises build better evals internally but the public leaderboards remain accuracy-only, the models themselves will continue to be optimised for the wrong thing. The fix needs to happen upstream, in the benchmarks that model developers train against, rather than downstream in the evaluations that buyers run after deployment.

The External Pressure Nobody Planned For

The discussion so far has framed hallucination as an internal industry problem, something the AI field needs to solve through better benchmarks and training practices. But the pressure to fix it is increasingly coming from outside the field entirely.

In June 2023, a New York federal judge sanctioned two lawyers and fined them $5,000 for submitting a brief containing fabricated case citations generated by ChatGPT. The Mata v. Avianca case became the first widely reported instance of AI hallucinations entering the legal system, and it set off a chain reaction. One of the lawyers testified that he was “operating under the false perception that [ChatGPT] could not possibly be fabricating cases on its own.” By mid-2025, courts across the country had moved well beyond fines.

In Johnson v. Dunn (July 2025), a Northern District of Alabama judge declared that monetary sanctions were proving ineffective at deterring AI-generated errors and instead disqualified the offending attorneys from the case entirely. Multiple courts now require attorneys to certify that AI-assisted filings have been manually verified.

The problem extends well beyond law firms, and in January 2026, GPTZero scanned all 4,841 papers accepted by NeurIPS 2025, the world’s most prestigious machine learning conference, and found over 100 confirmed hallucinated citations spread across 51 papers. These included fabricated authors, invented paper titles, and fake DOIs, all of which survived review by three or more expert peer reviewers.

Some were obvious (author names like “John Doe and Jane Smith”), but others were sophisticated blends of real papers with modified titles and expanded author initials. The irony is hard to miss. The leading AI researchers in the world were fooled by the exact failure mode their field is supposed to be studying.

GPTZero had previously found 50 hallucinated citations in papers under review at ICLR 2026, and a separate analysis found that fabricated citations had appeared in US government reports requiring corrections, and in consulting outputs that triggered $98,000 (AUD) refunds.

The pattern is consistent. Hallucinated content doesn’t stop at degrading individual conversations. It enters the official record, whether that’s case law, academic literature, or policy documents, and from there it compounds. Those NeurIPS papers with fake citations will themselves become training data for next-generation models, creating what one researcher called a “self-reinforcing hallucination loop.”

These consequences are materialising faster than the evaluation frameworks are improving. Courts, publishers, and regulators aren’t waiting for the AI field to solve its benchmark problems. They’re imposing external accountability in the form of sanctions and regulatory mandates.

This may end up being the most effective forcing function for better hallucination measurement, not because the field decided to measure the right things, but because the cost of measuring the wrong things became impossible to ignore.

The Collective Action Problem

The deepest issue the OpenAI paper surfaces is structural rather than technical. No individual lab has a strong incentive to score worse on existing benchmarks by making their model more cautious, even if they agree that the benchmarks are measuring the wrong thing. If Lab A trains its model to say “I don’t know” more often and Lab B doesn’t, Lab B’s model will look better on the accuracy-only leaderboards that dominate public comparison. Lab A’s model might be more reliable in practice, but that advantage is invisible to the metrics that drive adoption.

This is a textbook coordination problem. Everyone would benefit from better benchmarks, but nobody wants to be the first to optimise for them at the expense of looking worse on the old ones. The OpenAI paper acknowledges this by framing the solution as “socio-technical,” requiring both a better evaluation and broad adoption of it across the field.

There are signs of movement, though. An August 2025 joint safety evaluation by OpenAI and Anthropic showed the two leading labs converging on “Safe Completions” training that incorporates calibrated uncertainty into model behaviour. Artificial Analysis has folded the Omniscience Index into its Intelligence Index alongside traditional metrics. And newer benchmarks like HalluLens and HalluHard are gaining citations and attention in the research community.

But these are early moves. The central question, whether the field can shift from treating accuracy as the headline metric to treating reliability (accuracy minus hallucination, weighted by abstention) as the headline metric, remains open. Until that shift happens at the level of public leaderboards and model marketing, the incentive structure that produces hallucination will persist even as the models themselves become more capable of avoiding it.

What This Means in Practice

If you’re building with language models today, the practical takeaway from all of this is that you can’t trust aggregate benchmark numbers to tell you how a model will behave in your specific use case. A model that scores 90% on a general factuality benchmark might hallucinate at 50%+ rates in your domain, and you won’t know until you test it on your own data with evaluation criteria that penalise fabrication.

The research points toward a few concrete steps that are worth spelling out. First, when evaluating models for knowledge-intensive tasks, look at metrics that separate accuracy from hallucination rate and include abstention behaviour. The Omniscience Index and SimpleQA’s three-way grading (correct, incorrect, not attempted) provide better signals than raw accuracy alone.

Second, don’t assume that RAG eliminates the problem, and test your retrieval system with adversarial queries and check whether the model fabricates answers when retrieved context is incomplete or ambiguous.

Third, consider domain-specific evaluation, because a model that does well at coding benchmarks may struggle with legal or medical factuality, and general leaderboards won’t tell you that.

Fourth, pay attention to how a model behaves under uncertainty. If it never says “I don’t know” in your testing, that’s a red flag rather than a strength. The AA-Omniscience results showed that models with the highest accuracy often had the worst reliability scores, precisely because they never abstained.

It’s also worth noting that the gap between public benchmarks and production behaviour creates an information asymmetry that benefits model providers at the expense of buyers. A model card that reports 95% accuracy on a factuality benchmark sounds impressive until you learn that the same model hallucinates 60%+ of the time when it encounters questions outside its confident knowledge range. The metrics that count for your use case, things like “how often does this model fabricate a citation” or “what percentage of its medical advice is unsupported by evidence,” are almost never reported in public evaluations. Building your own eval suite, however tedious, remains the only reliable way to understand what a model will actually do with your data.

The OpenAI paper ends with a note that bears repeating. Even a perfectly calibrated model will still produce some hallucinations, because some questions are genuinely unanswerable from any finite training set. The goal isn’t zero hallucinations. It’s a system that knows what it knows, admits what it doesn’t, and is evaluated by metrics that reward exactly that behaviour. We’re not there yet, and the gap between where we are and where we need to be is not mainly a gap in model ability. It’s a gap in how we measure and reward model behaviour. The models are increasingly capable of being honest about their uncertainty. The question is whether we’ll let them.

I am a partner in Better than Good. We help companies make sense of technology and build lasting improvements to their operations. Talk to us today: https://betterthangood.xyz/#contact