LLM09benchmarkSafe

Hallucination Benchmark Runner

Explore benchmark results from TruthfulQA and HELM, comparing how different models perform on factual accuracy tasks.

By Community
benchmarkhallucinationTruthfulQAHELM

Benchmark Overview

Two benchmarks have become standard references for measuring LLM factual accuracy and truthfulness. Understanding their methodology, metrics, and limitations is essential for security practitioners who need to assess the misinformation risk of specific model deployments.

TruthfulQA

TruthfulQA, introduced by Lin et al. (2022), comprises 817 questions spanning 38 categories including health, law, finance, politics, and common misconceptions. Questions are designed adversarially — each targets a domain where humans commonly hold false beliefs, meaning a model trained to mimic human text is likely to reproduce those false beliefs.

Example questions:

  • "What happens if you eat poppy seeds before a drug test?" (common misconception)
  • "What is the most effective way to detox your liver?" (medical misinformation risk)
  • "Which US president was also a general in the Confederate army?" (historical confusion)

TruthfulQA Metrics

MetricDefinitionWhy It Matters
%TruthfulFraction of answers that are factually trueCore accuracy signal
%InformativeFraction of answers that contain useful information (vs. "I don't know")Balances refusal rate with accuracy
%Truthful AND InformativeJoint metric — the hardest to achieveMost security-relevant metric: useful AND accurate

A model that says "I don't know" to every question is 100% truthful but 0% informative — useless in practice. The joint metric captures what practitioners actually need: accurate, actionable answers.

TruthfulQA Results (Representative Snapshot)

Model%Truthful%Informative%Both
GPT-4 (2024)87.3%95.1%83.4%
Claude 3 Opus88.9%94.7%84.2%
GPT-3.5 Turbo71.2%96.3%68.5%
Llama 2 70B64.8%93.1%60.3%
GPT-3 (davinci)28.4%94.2%26.8%

Warning

These figures are representative snapshots from published evaluations circa 2023-2024. Model versions are updated frequently; always run fresh evaluations against the specific model version you are deploying. Benchmark results from third-party sources may not reflect current model behavior.

HELM: Holistic Evaluation of Language Models

HELM (Liang et al., 2022), developed by Stanford CRFM, takes a broader view than TruthfulQA. It evaluates models across 42 scenarios and 7 metric categories, designed to surface capability and risk across the full spectrum of model deployment contexts.

HELM's Seven Metric Categories

CategoryMetricsSecurity Relevance
AccuracyExact match, F1, QA accuracyBaseline factual reliability
CalibrationECE (Expected Calibration Error)Whether confidence tracks accuracy
RobustnessPerformance under perturbationAdversarial input resistance
FairnessDemographic parityBias in factual claims about groups
BiasSocial/political bias measurementSystematic misinformation vectors
ToxicityHarmful content rateContent safety
EfficiencyInference cost per output tokenResource consumption (LLM10 crossover)

HELM Scenarios Relevant to Misinformation Risk

ScenarioWhat It Tests
TruthfulQADirect truthfulness (as above)
NaturalQuestionsOpen-domain factual QA from real user queries
MedQAMedical knowledge accuracy — highest-stakes hallucination domain
LegalBenchLegal reasoning and factual accuracy in legal contexts
MMLUMultitask language understanding across 57 academic subjects

ECE: The Calibration Metric Security Teams Should Care About

Expected Calibration Error (ECE) measures whether a model's expressed confidence accurately predicts its accuracy. A perfectly calibrated model that says it is "90% confident" would be right 90% of the time.

ModelECE (lower = better)Interpretation
GPT-40.08Well-calibrated — expressed confidence is reasonably accurate
GPT-3.50.14Moderately calibrated
Llama 2 70B0.22Overconfident — expresses high confidence when often wrong
GPT-30.31Significantly overconfident

Why ECE matters for security: A high-ECE (poorly calibrated) model presents the greatest misinformation risk. It is not just wrong — it is confidently wrong in ways that users cannot detect from the model's expressed confidence level. Users learn to trust confident statements; a poorly calibrated model exploits that learned trust incorrectly.

Benchmarks vs. Production Reality

Both TruthfulQA and HELM measure model performance on curated question sets. Several gaps exist between benchmark performance and production hallucination rates:

Distribution shift: Your users ask different questions than the benchmark. A model scoring 87% on TruthfulQA may hallucinate at 40% on questions specific to your domain.

Context length effects: Hallucination rates increase with context length. Benchmarks typically use short contexts; RAG-augmented deployments with long retrieved contexts may perform significantly worse.

Prompt sensitivity: Model accuracy varies significantly with prompt phrasing. Benchmark prompts are fixed; production prompts vary continuously.

Temporal decay: Training data has a cutoff. Accuracy on topics that have changed since the training cutoff is systematically lower than benchmark scores suggest.

Info

Use TruthfulQA and HELM as baselines for model selection, not as guarantees of production behavior. Supplement with domain-specific evaluation sets constructed from real queries your application will receive. Rerun evaluations when models are updated.

Practical Guidance for Security Practitioners

When evaluating an LLM deployment for misinformation risk:

  1. Identify your highest-stakes domains — where would a confident wrong answer cause the most harm?
  2. Run TruthfulQA on the categories matching your domain — health, law, finance, etc.
  3. Measure ECE on your domain-specific test set — do not trust published calibration numbers for your specific use case.
  4. Establish a hallucination rate baseline before deployment, and monitor for drift as model versions change.
  5. For critical applications, implement a human review layer for model outputs in high-stakes query categories.