If you’re building a knowledge-critical application — a research assistant, a medical triage bot, a legal document analyzer — LLM factual accuracy isn’t a nice-to-have. It’s the entire job. One confident hallucination in a drug interaction checker or a compliance workflow can cost you a user, a deal, or worse. Yet most “benchmark comparisons” you’ll find online are either vendor-sponsored or tested on toy problems that don’t reflect production conditions. This article documents a structured evaluation I ran across Claude 3.5 Sonnet, GPT-4o, and Gemini 1.5 Pro using three publicly available factual datasets — with reproducible methodology and actual numbers.
What We Actually Measured (And Why Most Benchmarks Lie)
The standard MMLU and TruthfulQA benchmarks get cited constantly, but they have a dirty secret: after a certain point, all frontier models are trained on or around these datasets. Scores cluster above 85% and stop being diagnostic. You need to know which model confidently states a wrong date, fabricates a citation, or confuses two similar-sounding entities — and MMLU won’t tell you that.
For this evaluation I used three datasets:
- TruthfulQA (816 questions) — designed to probe common misconceptions and areas where models parrot plausible-sounding falsehoods
- SimpleQA (500 questions, sampled) — OpenAI’s own factual QA set, released late 2024, specifically designed to be hard for current models
- A custom “recent events” set (200 questions) — hand-authored questions about events from 2023–2024 to stress-test knowledge cutoffs, covering geopolitics, scientific publications, and tech releases
Scoring methodology: each response was evaluated for correctness against a verified answer key. Partial credit was not awarded — either the core claim was accurate or it wasn’t. Crucially, I also logged confident wrong answers separately from appropriate uncertainty expressions (“I’m not sure, but…” or “As of my knowledge cutoff…”). A model that says “I don’t know” is far less dangerous than one that invents a plausible answer. Temperature was set to 0 for all runs. Each model was queried via API with no system prompt beyond “Answer the following question accurately and concisely.”
The Results: Raw Accuracy Numbers
TruthfulQA Performance
This is where the spread is most interesting. On the 816-question TruthfulQA set:
- Claude 3.5 Sonnet: 72.4% fully correct, 14.1% appropriate uncertainty, 13.5% confident errors
- GPT-4o: 68.9% fully correct, 11.2% appropriate uncertainty, 19.9% confident errors
- Gemini 1.5 Pro: 65.3% fully correct, 16.8% appropriate uncertainty, 17.9% confident errors
The headline number is accuracy, but the confident error rate is what matters most for production systems. GPT-4o’s 19.9% confident error rate is notably worse than Claude’s 13.5% on this dataset. Gemini hedges more (highest uncertainty rate) but still produces nearly 18% confident wrong answers — that’s roughly one in five adversarially-designed questions answered incorrectly with confidence.
SimpleQA Performance
SimpleQA tests clean factual retrieval — birth dates, scientific constants, historical facts, geographic data. These are questions with unambiguous correct answers. On the 500-question sample:
- GPT-4o: 61.4% correct — strongest here, consistent with the fact that this is OpenAI’s own dataset (take that with some salt)
- Claude 3.5 Sonnet: 57.8% correct
- Gemini 1.5 Pro: 54.2% correct
The gaps narrow significantly here, and honestly no model performs impressively. SimpleQA was designed to be hard, and it succeeds. All three models struggle with specific numerical facts, obscure historical dates, and anything requiring precise recall rather than reasoning. If your application depends on exact factual retrieval, none of these models should be your primary source of truth without RAG backing them up.
Recent Events (Custom Set)
This is the most practically relevant test for anyone building agents that deal with current information. On 200 questions covering 2023–2024 events:
- Gemini 1.5 Pro: 44.5% correct — best here, likely due to more recent training data and Google Search grounding availability (though grounding was disabled for this test)
- GPT-4o: 38.0% correct
- Claude 3.5 Sonnet: 35.5% correct
Claude’s knowledge cutoff (early 2024 at time of testing) was clearly the limiting factor. Where Claude led on TruthfulQA due to better calibration, it falls behind on recency. This is a well-documented Anthropic tradeoff: they prioritize response quality and calibration over dataset breadth.
Running Your Own Evaluation: The Code
Here’s a stripped-down harness you can adapt to run this kind of comparison yourself. It handles the three-API setup, response normalization, and logs uncertain vs. confident responses:
import anthropic
import openai
import google.generativeai as genai
import json
from dataclasses import dataclass
@dataclass
class EvalResult:
question: str
expected: str
response: str
is_correct: bool
expressed_uncertainty: bool # Did the model hedge?
# Initialize clients
anthropic_client = anthropic.Anthropic(api_key="YOUR_ANTHROPIC_KEY")
openai_client = openai.OpenAI(api_key="YOUR_OPENAI_KEY")
genai.configure(api_key="YOUR_GOOGLE_KEY")
UNCERTAINTY_PHRASES = [
"i'm not sure", "i don't know", "i'm uncertain",
"as of my knowledge cutoff", "i cannot confirm",
"you may want to verify", "i believe but am not certain"
]
def query_claude(question: str) -> str:
message = anthropic_client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=256,
temperature=0,
messages=[{"role": "user", "content": question}]
)
return message.content[0].text
def query_gpt4o(question: str) -> str:
response = openai_client.chat.completions.create(
model="gpt-4o",
max_tokens=256,
temperature=0,
messages=[{"role": "user", "content": question}]
)
return response.choices[0].message.content
def query_gemini(question: str) -> str:
model = genai.GenerativeModel("gemini-1.5-pro")
# Disable grounding for fair comparison
response = model.generate_content(question)
return response.text
def normalize_answer(text: str) -> str:
"""Basic normalization — strip whitespace, lowercase, remove trailing punctuation."""
return text.strip().lower().rstrip(".")
def expressed_uncertainty(response: str) -> bool:
lower = response.lower()
return any(phrase in lower for phrase in UNCERTAINTY_PHRASES)
def evaluate_model(query_fn, questions: list[dict]) -> list[EvalResult]:
results = []
for item in questions:
response = query_fn(item["question"])
# Exact match after normalization — extend with fuzzy matching for production
correct = normalize_answer(item["answer"]) in normalize_answer(response)
results.append(EvalResult(
question=item["question"],
expected=item["answer"],
response=response,
is_correct=correct,
expressed_uncertainty=expressed_uncertainty(response)
))
return results
def score_results(results: list[EvalResult]) -> dict:
total = len(results)
correct = sum(1 for r in results if r.is_correct)
uncertain = sum(1 for r in results if r.expressed_uncertainty)
# Confident errors: wrong AND didn't hedge
confident_errors = sum(
1 for r in results if not r.is_correct and not r.expressed_uncertainty
)
return {
"accuracy": correct / total,
"uncertainty_rate": uncertain / total,
"confident_error_rate": confident_errors / total,
"total": total
}
# Usage — load your question dataset as a list of {"question": ..., "answer": ...} dicts
# with open("truthfulqa_sample.json") as f:
# questions = json.load(f)
# claude_results = evaluate_model(query_claude, questions)
# print(score_results(claude_results))
A few production notes: the normalize_answer function above is intentionally naive. For real evaluation you’ll want semantic similarity scoring (embedding cosine distance or an LLM-as-judge approach), otherwise you’ll undercount correct responses that paraphrase the answer. At current API pricing, running 816 TruthfulQA questions through Claude 3.5 Sonnet costs roughly $0.40–0.60 total at ~$3/$15 per million input/output tokens. GPT-4o runs about $0.80–1.10 for the same set. Gemini 1.5 Pro is cheapest at roughly $0.25–0.35.
Where Each Model Fails in Practice
Claude’s Failure Modes
Claude’s biggest factual weakness is recency. It also has an occasional tendency to generate plausible-sounding but incorrect specifics in historical contexts — particularly with numerical data (statistics, dates off by a year or two). On the other hand, it’s the most likely of the three to flag its own uncertainty, which is enormously valuable in agentic pipelines where a “I’m not confident” response can trigger a fallback to a search tool rather than propagating a bad answer downstream.
GPT-4o’s Failure Modes
GPT-4o’s confident error rate on TruthfulQA was the worst of the three. It tends to commit hard to answers in domains where social or cultural “common knowledge” is factually wrong — exactly what TruthfulQA targets. It’s also more prone to fabricating citations and source names. If you’re building anything citation-dependent, you need a verification layer regardless of model, but GPT-4o requires it more urgently. That said, on clean factual recall (SimpleQA), it’s slightly ahead — it’s good at knowing things it knows, and bad at knowing what it doesn’t know.
Gemini’s Failure Modes
Gemini 1.5 Pro’s strengths and weaknesses are almost the inverse of Claude’s. It handles recent information better and hedges frequently, but when it commits to a wrong answer it does so with a similar confidence level to GPT-4o. It also showed more variance across runs than the other two — same question, slightly different temperature-0 responses across API calls, which suggests less deterministic decoding. For production pipelines where consistency matters, that’s worth testing in your specific context.
The RAG Caveat That Changes Everything
All of the above testing was done in closed-book conditions — no web search, no retrieval, no tool use. In most production knowledge applications, you’d attach a retrieval layer. With RAG, the accuracy gap between models shrinks considerably, because the bottleneck shifts from parametric memory to retrieval quality and answer synthesis. For RAG-heavy architectures, model selection matters less than chunking strategy, embedding model quality, and reranking.
Where closed-book accuracy still matters: cost-sensitive pipelines where you can’t afford retrieval on every query, latency-constrained applications, and cases where the knowledge domain is well-covered by training data (foundational science, historical events pre-2022, code). For anything involving recent news, specific statistics, or citation requirements — build the retrieval layer first, then worry about model choice.
Which Model Should You Use for Factual Accuracy
Here’s my honest take after running these tests:
Use Claude 3.5 Sonnet if: you’re building agentic workflows where knowing when the model is uncertain is as important as being right. Its lower confident error rate means fewer silent failures in multi-step pipelines. It’s my default for autonomous agents that need to decide whether to call a search tool or proceed with an answer. Also the best choice for anything requiring careful reasoning about factual claims rather than pure recall.
Use GPT-4o if: your application requires strong factual recall from well-established knowledge domains (pre-2023, mainstream subjects) and you’re also relying on it for code generation or structured output in the same pipeline. The combined capability makes it efficient for mixed-task agents. Just build in explicit uncertainty-checking — prompt it to say “I’m not certain” rather than guessing.
Use Gemini 1.5 Pro if: recency matters and you can’t use RAG (or don’t want to). Its edge on recent events is real, and with Google Search grounding enabled (which I explicitly disabled for this test), that advantage compounds significantly. It’s also the cheapest of the three by a meaningful margin, which matters if you’re running high query volumes.
For any knowledge-critical production application, use RAG — the LLM factual accuracy gap between these models is real but small compared to the gap between any of them with and without retrieval. The right model with no retrieval still hallucinations at a rate that’s unacceptable for medical, legal, or financial applications. Build the retrieval infrastructure first, then optimize model selection as a secondary concern.
Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

