AI Hallucination Rate: How to Measure and Reduce It
A practical guide to measuring LLM hallucination rate - what hallucination is, how to build an evaluation set, which metrics to use, and how to reduce hallucination in production.
Hallucination is when a language model generates content that is factually incorrect, ungrounded, or fabricated - presented with apparent confidence. It is the most visible failure mode of LLMs and the most common reason enterprise customers reject AI products.
What Counts as a Hallucination?
There are three distinct types of hallucination that require different measurement approaches:
Factual hallucination - The model states something factually incorrect. These are verifiable against ground truth.
Grounding hallucination (in RAG systems) - The model generates a claim that is not supported by the retrieved context it was given. The model “knows” something that wasn’t in its input - a particularly dangerous failure mode for knowledge-grounded AI products.
Citation hallucination - The model invents plausible-sounding but non-existent sources. The Mata v. Avianca case, where fabricated case citations were submitted to federal court, is the canonical example.
How to Build a Hallucination Evaluation Set
A useful hallucination evaluation set has three properties:
Domain-specific. Generic benchmarks tell you about a model’s general hallucination tendency, not its hallucination rate on your use case domain.
Ground-truth verified. Every question in the set has a verified, unambiguous correct answer - verified by a domain expert, not by the model being evaluated.
Adversarial coverage. Include questions designed to elicit hallucination: questions near but outside the model’s training distribution, questions with plausible-sounding false premises, and questions requiring precise numerical or citation recall.
A practical evaluation set size is 100–200 questions.
Measuring Hallucination Rate
The standard metric is the hallucination rate: the proportion of responses containing at least one hallucinated claim.
For RAG systems, also measure grounding rate: the proportion of claims in each response that are supported by the retrieved context.
What Is an Acceptable Hallucination Rate?
| Use case | Acceptable hallucination rate |
|---|---|
| Medical diagnosis support | <0.5% |
| Legal research | <1% |
| Financial advice | <2% |
| General customer support | <5% |
Enterprise B2B procurement teams increasingly specify a maximum hallucination rate. 5% is a common threshold; regulated industries may require below 2%.
How to Reduce Hallucination Rate
The most effective interventions, ranked by impact:
- RAG over facts. Retrieve ground truth from a verified knowledge base rather than relying on parametric model knowledge.
- Constrained output formats. Prompt the model to state uncertainty explicitly rather than generating a plausible-sounding answer.
- Temperature reduction. Lower temperature reduces creative generation and hallucinatory behaviour.
- Fine-tuning on domain data. For narrow domains, fine-tuning on high-quality domain data significantly reduces out-of-distribution hallucination.
- Output validation. Post-process model outputs to verify factual claims against a knowledge base.
Book a free AI QA scope call to benchmark your LLM’s hallucination rate and get a prioritised remediation plan.
Ship AI You Can Trust.
Book a free 30-minute AI QA scope call with our experts. We review your model, data pipeline, or AI product - and show you exactly what to test before you ship.
Talk to an Expert