February 15, 2026 · 3 min read · aiml.qa

AI Hallucination Rate: How to Measure and Reduce It

A practical guide to measuring LLM hallucination rate - what hallucination is, how to build an evaluation set, which metrics to use, and how to reduce hallucination in production.

Hallucination is when a language model generates content that is factually incorrect, ungrounded, or fabricated - presented with apparent confidence. It is the most visible failure mode of LLMs and the most common reason enterprise customers reject AI products.

What Counts as a Hallucination?

There are three distinct types of hallucination that require different measurement approaches:

Factual hallucination - The model states something factually incorrect. These are verifiable against ground truth.

Grounding hallucination (in RAG systems) - The model generates a claim that is not supported by the retrieved context it was given. The model “knows” something that wasn’t in its input - a particularly dangerous failure mode for knowledge-grounded AI products.

Citation hallucination - The model invents plausible-sounding but non-existent sources. The Mata v. Avianca case, where fabricated case citations were submitted to federal court, is the canonical example.

How to Build a Hallucination Evaluation Set

A useful hallucination evaluation set has three properties:

Domain-specific. Generic benchmarks tell you about a model’s general hallucination tendency, not its hallucination rate on your use case domain.
Ground-truth verified. Every question in the set has a verified, unambiguous correct answer - verified by a domain expert, not by the model being evaluated.
Adversarial coverage. Include questions designed to elicit hallucination: questions near but outside the model’s training distribution, questions with plausible-sounding false premises, and questions requiring precise numerical or citation recall.

A practical evaluation set size is 100–200 questions.

Measuring Hallucination Rate

The standard metric is the hallucination rate: the proportion of responses containing at least one hallucinated claim.

For RAG systems, also measure grounding rate: the proportion of claims in each response that are supported by the retrieved context.

What Is an Acceptable Hallucination Rate?

Use case	Acceptable hallucination rate
Medical diagnosis support	<0.5%
Legal research	<1%
Financial advice	<2%
General customer support	<5%

Enterprise B2B procurement teams increasingly specify a maximum hallucination rate. 5% is a common threshold; regulated industries may require below 2%.

How to Reduce Hallucination Rate

The most effective interventions, ranked by impact:

RAG over facts. Retrieve ground truth from a verified knowledge base rather than relying on parametric model knowledge.
Constrained output formats. Prompt the model to state uncertainty explicitly rather than generating a plausible-sounding answer.
Temperature reduction. Lower temperature reduces creative generation and hallucinatory behaviour.
Fine-tuning on domain data. For narrow domains, fine-tuning on high-quality domain data significantly reduces out-of-distribution hallucination.
Output validation. Post-process model outputs to verify factual claims against a knowledge base.

Book a free AI QA scope call to benchmark your LLM’s hallucination rate and get a prioritised remediation plan.

Ship AI You Can Trust.

Book a free 30-minute AI QA scope call with our experts. We review your model, data pipeline, or AI product - and show you exactly what to test before you ship.

Talk to an Expert