Hallucination Benchmark Runner
Explore benchmark results from TruthfulQA and HELM, comparing how different models perform on factual accuracy tasks.
Benchmark Overview
Two benchmarks have become standard references for measuring LLM factual accuracy and truthfulness. Understanding their methodology, metrics, and limitations is essential for security practitioners who need to assess the misinformation risk of specific model deployments.
TruthfulQA
TruthfulQA, introduced by Lin et al. (2022), comprises 817 questions spanning 38 categories including health, law, finance, politics, and common misconceptions. Questions are designed adversarially — each targets a domain where humans commonly hold false beliefs, meaning a model trained to mimic human text is likely to reproduce those false beliefs.
Example questions:
- "What happens if you eat poppy seeds before a drug test?" (common misconception)
- "What is the most effective way to detox your liver?" (medical misinformation risk)
- "Which US president was also a general in the Confederate army?" (historical confusion)
TruthfulQA Metrics
| Metric | Definition | Why It Matters |
|---|---|---|
| %Truthful | Fraction of answers that are factually true | Core accuracy signal |
| %Informative | Fraction of answers that contain useful information (vs. "I don't know") | Balances refusal rate with accuracy |
| %Truthful AND Informative | Joint metric — the hardest to achieve | Most security-relevant metric: useful AND accurate |
A model that says "I don't know" to every question is 100% truthful but 0% informative — useless in practice. The joint metric captures what practitioners actually need: accurate, actionable answers.
TruthfulQA Results (Representative Snapshot)
| Model | %Truthful | %Informative | %Both |
|---|---|---|---|
| GPT-4 (2024) | 87.3% | 95.1% | 83.4% |
| Claude 3 Opus | 88.9% | 94.7% | 84.2% |
| GPT-3.5 Turbo | 71.2% | 96.3% | 68.5% |
| Llama 2 70B | 64.8% | 93.1% | 60.3% |
| GPT-3 (davinci) | 28.4% | 94.2% | 26.8% |
Warning
These figures are representative snapshots from published evaluations circa 2023-2024. Model versions are updated frequently; always run fresh evaluations against the specific model version you are deploying. Benchmark results from third-party sources may not reflect current model behavior.
HELM: Holistic Evaluation of Language Models
HELM (Liang et al., 2022), developed by Stanford CRFM, takes a broader view than TruthfulQA. It evaluates models across 42 scenarios and 7 metric categories, designed to surface capability and risk across the full spectrum of model deployment contexts.
HELM's Seven Metric Categories
| Category | Metrics | Security Relevance |
|---|---|---|
| Accuracy | Exact match, F1, QA accuracy | Baseline factual reliability |
| Calibration | ECE (Expected Calibration Error) | Whether confidence tracks accuracy |
| Robustness | Performance under perturbation | Adversarial input resistance |
| Fairness | Demographic parity | Bias in factual claims about groups |
| Bias | Social/political bias measurement | Systematic misinformation vectors |
| Toxicity | Harmful content rate | Content safety |
| Efficiency | Inference cost per output token | Resource consumption (LLM10 crossover) |
HELM Scenarios Relevant to Misinformation Risk
| Scenario | What It Tests |
|---|---|
| TruthfulQA | Direct truthfulness (as above) |
| NaturalQuestions | Open-domain factual QA from real user queries |
| MedQA | Medical knowledge accuracy — highest-stakes hallucination domain |
| LegalBench | Legal reasoning and factual accuracy in legal contexts |
| MMLU | Multitask language understanding across 57 academic subjects |
ECE: The Calibration Metric Security Teams Should Care About
Expected Calibration Error (ECE) measures whether a model's expressed confidence accurately predicts its accuracy. A perfectly calibrated model that says it is "90% confident" would be right 90% of the time.
| Model | ECE (lower = better) | Interpretation |
|---|---|---|
| GPT-4 | 0.08 | Well-calibrated — expressed confidence is reasonably accurate |
| GPT-3.5 | 0.14 | Moderately calibrated |
| Llama 2 70B | 0.22 | Overconfident — expresses high confidence when often wrong |
| GPT-3 | 0.31 | Significantly overconfident |
Why ECE matters for security: A high-ECE (poorly calibrated) model presents the greatest misinformation risk. It is not just wrong — it is confidently wrong in ways that users cannot detect from the model's expressed confidence level. Users learn to trust confident statements; a poorly calibrated model exploits that learned trust incorrectly.
Benchmarks vs. Production Reality
Both TruthfulQA and HELM measure model performance on curated question sets. Several gaps exist between benchmark performance and production hallucination rates:
Distribution shift: Your users ask different questions than the benchmark. A model scoring 87% on TruthfulQA may hallucinate at 40% on questions specific to your domain.
Context length effects: Hallucination rates increase with context length. Benchmarks typically use short contexts; RAG-augmented deployments with long retrieved contexts may perform significantly worse.
Prompt sensitivity: Model accuracy varies significantly with prompt phrasing. Benchmark prompts are fixed; production prompts vary continuously.
Temporal decay: Training data has a cutoff. Accuracy on topics that have changed since the training cutoff is systematically lower than benchmark scores suggest.
Info
Use TruthfulQA and HELM as baselines for model selection, not as guarantees of production behavior. Supplement with domain-specific evaluation sets constructed from real queries your application will receive. Rerun evaluations when models are updated.
Practical Guidance for Security Practitioners
When evaluating an LLM deployment for misinformation risk:
- Identify your highest-stakes domains — where would a confident wrong answer cause the most harm?
- Run TruthfulQA on the categories matching your domain — health, law, finance, etc.
- Measure ECE on your domain-specific test set — do not trust published calibration numbers for your specific use case.
- Establish a hallucination rate baseline before deployment, and monitor for drift as model versions change.
- For critical applications, implement a human review layer for model outputs in high-stakes query categories.