February 15, 2026 · 3 min read · aiml.qa

AI Hallucination Rate: How to Measure and Reduce It

A practical guide to measuring LLM hallucination rate - what hallucination is, how to build an evaluation set, which metrics to use, and how to reduce hallucination in production.

AI Hallucination Rate: How to Measure and Reduce It

Hallucination is when a language model generates content that is factually incorrect, ungrounded, or fabricated - presented with apparent confidence. It is the most visible failure mode of LLMs and the most common reason enterprise customers reject AI products.

What Counts as a Hallucination?

There are three distinct types of hallucination that require different measurement approaches:

Factual hallucination - The model states something factually incorrect. These are verifiable against ground truth.

Grounding hallucination (in RAG systems) - The model generates a claim that is not supported by the retrieved context it was given. The model “knows” something that wasn’t in its input - a particularly dangerous failure mode for knowledge-grounded AI products.

Citation hallucination - The model invents plausible-sounding but non-existent sources. The Mata v. Avianca case, where fabricated case citations were submitted to federal court, is the canonical example.

How to Build a Hallucination Evaluation Set

A useful hallucination evaluation set has three properties:

  1. Domain-specific. Generic benchmarks tell you about a model’s general hallucination tendency, not its hallucination rate on your use case domain.

  2. Ground-truth verified. Every question in the set has a verified, unambiguous correct answer - verified by a domain expert, not by the model being evaluated.

  3. Adversarial coverage. Include questions designed to elicit hallucination: questions near but outside the model’s training distribution, questions with plausible-sounding false premises, and questions requiring precise numerical or citation recall.

A practical evaluation set size is 100–200 questions.

Measuring Hallucination Rate

The standard metric is the hallucination rate: the proportion of responses containing at least one hallucinated claim.

For RAG systems, also measure grounding rate: the proportion of claims in each response that are supported by the retrieved context.

What Is an Acceptable Hallucination Rate?

Use caseAcceptable hallucination rate
Medical diagnosis support<0.5%
Legal research<1%
Financial advice<2%
General customer support<5%

Enterprise B2B procurement teams increasingly specify a maximum hallucination rate. 5% is a common threshold; regulated industries may require below 2%.

How to Reduce Hallucination Rate

The most effective interventions, ranked by impact:

  1. RAG over facts. Retrieve ground truth from a verified knowledge base rather than relying on parametric model knowledge.
  2. Constrained output formats. Prompt the model to state uncertainty explicitly rather than generating a plausible-sounding answer.
  3. Temperature reduction. Lower temperature reduces creative generation and hallucinatory behaviour.
  4. Fine-tuning on domain data. For narrow domains, fine-tuning on high-quality domain data significantly reduces out-of-distribution hallucination.
  5. Output validation. Post-process model outputs to verify factual claims against a knowledge base.

Book a free AI QA scope call to benchmark your LLM’s hallucination rate and get a prioritised remediation plan.

Ship AI You Can Trust.

Book a free 30-minute AI QA scope call with our experts. We review your model, data pipeline, or AI product - and show you exactly what to test before you ship.

Talk to an Expert