OpenAI Rarely Publishes Paper: We Found the Culprit Behind AI Hallucinations

What is the most notorious bug of AI? It is not code crashes, but "hallucination"—the model confidently fabricates facts, making it hard for you to distinguish between truth and falsehood. This fundamental challenge is a key obstacle preventing us from fully trusting AI.

Large models can have hallucinations, which has almost become common knowledge, making every serious user of large models cautious. OpenAI also pointed out: "ChatGPT can also produce hallucinations. The hallucinations of GPT-5 are noticeably fewer, especially when performing reasoning, but hallucinations can still occur. Hallucinations remain a fundamental challenge faced by all large language models."

Although various methods have been proposed in academia to reduce model hallucination, there is currently no effective remedy that can completely "cure" model hallucination.

So, why do large models actually experience hallucinations? Today, OpenAI published a rare paper that systematically reveals the root causes of these hallucinations.

First, define hallucination. A simple definition provided by OpenAI is: "the situation where the model confidently generates false answers."

As for the reason, simply put: standard training and evaluation procedures tend to reward guessing rather than rewarding the model for being brave enough to admit uncertainty.

  • Paper Title: Why Language Models Hallucinate
  • Paper Address:

Let's take a closer look at what OpenAI has actually discovered.

What is a hallucination?

Hallucinations are statements generated by language models that appear reasonable but are actually incorrect.

Even seemingly simple questions can arise in surprising ways. OpenAI gave an example where, when asked about the title of Adam Tauman Kalai's (the first author of the paper) doctoral thesis, different widely-used chatbots confidently provided three different answers, none of which were correct.

When asked about his birthday, he gave three different dates, all of which were incorrect.

Learn to test

OpenAI stated that hallucinations persist in part because the current evaluation methods set up the wrong incentives. While the evaluations themselves do not directly cause hallucinations, the way most evaluations measure model performance tends to encourage models to guess rather than honestly confront uncertainty.

You can think of it as a multiple-choice test. If you don't know the answer but guess randomly, you might get lucky and guess correctly. Leaving it blank will definitely result in a score of zero. Similarly, when models are scored solely based on accuracy (i.e., the percentage of questions answered correctly), they are encouraged to guess rather than admit "I don't know."

For another example, suppose a language model is asked about someone's birthday, but it doesn't know. If it guesses "September 10th," then it has a 1/365 chance of being correct. Saying "I don't know" will definitely score zero. In thousands of test questions, the guessing model ultimately performs better on the scoreboard than the cautious model that admits uncertainty.

For questions that have only one "correct answer," three types of answers can be considered: accurate answers, incorrect answers, and non-risky guesses that the model is unwilling to make.

OpenAI states that the waiver answer is part of the humility metric, and humility is one of OpenAI's core values.

Most scoring metrics prioritize models based on accuracy, but incorrect answers are worse than non-responses. OpenAI's model guidelines indicate that expressing uncertainty or asking for clarification is preferable to confidently providing potentially incorrect information.

Taking the evaluation of SimpleQA in the GPT5 system card as an example.

In terms of accuracy, the earlier OpenAI o4-mini model performs slightly better. However, its error rate (i.e., hallucination rate) is significantly higher. Making strategic guesses in uncertain situations can improve accuracy but also increases errors and hallucinations.

When averaging the results of dozens of evaluations, most benchmarks exclude accuracy metrics, but this leads to a false dichotomy between right and wrong.

In simple evaluations like SimpleQA, some models achieve an accuracy close to 100%, thereby eliminating hallucinations. However, in more challenging evaluations and real-world use, accuracy tends to be fixed below 100% because the answers to some questions cannot be determined for various reasons (such as unavailable information, limited cognitive capacity of smaller models, or ambiguities that require clarification).

Nevertheless, evaluation metrics that are solely based on accuracy still dominate the leaderboard and model cards, which encourages developers to build models that can make guesses rather than retreat.

It is precisely for this reason that even as models become more advanced, they still tend to produce hallucinations. One reason is that they tend to confidently provide incorrect answers rather than admit uncertainty.

better evaluation method

In this regard, OpenAI pointed out a simple solution: the penalty for confident errors is greater than the penalty for uncertainty, and partial credit should be given for appropriately expressing uncertainty.

This idea is not new. Some standardized tests have long used negative scoring for incorrect answers or partial credit for unanswered questions to discourage blind guessing. Some research teams have also explored assessment methods that take uncertainty and calibration into account.

But OpenAI stated that simply adding some new uncertainty awareness tests is not enough. The widely used accuracy-based evaluation methods need to be updated so that their scoring can prevent guessing.

If the main evaluation metrics continue to reward the model for lucky guesses, the model will continue to learn to guess. Modifying the evaluation metrics can broaden the adoption of hallucination techniques, including newly developed and previously researched techniques.

How Hallucinations Arise from Next Word Predictions

We have already discussed why illusions are so difficult to shake off, but where do these highly specific factual errors actually come from?

After all, large pre-trained models rarely make other types of errors, such as spelling mistakes and mismatched parentheses.

OpenAI stated that the distinction must lie in the patterns present in the data.

Language models first learn through pre-training, which is a process of predicting the next word in massive amounts of text.

Unlike traditional machine learning problems, each statement does not have a "true/false" label. The model only sees positive examples of fluent language and must approximate the overall distribution.

When there are no examples labeled as invalid, it becomes more difficult to distinguish between valid and invalid statements. But even with labels, some errors are unavoidable.

To understand the reason, one can consider a simpler analogy. In image recognition, if millions of cat and dog photos are labeled as "cat" or "dog", the algorithm can learn to reliably classify them. But imagine if each pet photo were labeled with the pet's birthday. Since birthdays are essentially random, no matter how advanced the algorithm is, this task will always produce errors.

The same principles apply to pre-training. Spelling and parentheses follow a consistent pattern, so these errors will disappear as the scale expands. However, arbitrary low-frequency facts like a pet's birthday cannot be predicted solely based on patterns, leading to hallucinations.

OpenAI's analysis explains what types of hallucinations can be produced by next-word prediction. Ideally, the subsequent phase after pre-training should be able to eliminate these hallucinations, but this has not been fully achieved due to the reasons described in the previous section.

Summary

OpenAI stated: "We hope that the statistical perspective in this article can clarify the nature of hallucinations and dispel some common misconceptions."

Some claim that hallucinations can be eliminated by increasing accuracy, as a model with 100% accuracy will never produce hallucinations.

Discovery: Accuracy will never reach 100% because, regardless of the model size, search, and reasoning capabilities, some real-world problems are inherently unanswerable.

Some claim: hallucinations are inevitable.

Discovery: Hallucination is not inevitable, as language models can choose not to respond when uncertain.

Some claim that avoiding illusions requires a certain level of intelligence, which can only be achieved by large models.

Finding: Smaller models are more aware of their limitations. For example, when asked to answer a question in Māori, a small model that does not understand Māori can simply respond "I don't know," while a model that knows some Māori must determine its confidence. As discussed in the paper, the computational effort required for "calibration" is far less than maintaining accuracy.

Some claim: hallucination is a mysterious flaw of modern language models.

Discovery: We can understand the statistical mechanisms behind the generation of illusions and the rewards obtained in evaluations.

Some claim that to measure illusions, we only need a good illusion assessment.

Findings: Some researchers have published assessments of illusions. However, a good illusion assessment is almost ineffective compared to hundreds of traditional accuracy-based assessments that penalize humility and reward guessing. Instead, all major assessment metrics need to be redesigned to reward the expression of uncertainty.

OpenAI stated: "Our latest model has a lower rate of hallucinations, and we will continue to strive to further reduce the confidence error rate of language model outputs."

By the way, according to TechCrunch, OpenAI is restructuring its Model Behavior team, a small but influential group of researchers who determine how the company's AI models interact with people. Now, the team will report to Max Schwarzer, OpenAI's Senior Training Manager.

The team’s founder and leader, Joanne Jang, will launch a new project at the company called oai Labs. According to her tweet: "This is a research-oriented team focused on inventing and designing new interface prototypes for people to collaborate with AI."

GPT3.76%
WHY-2.37%
MAX-0.79%
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)