RAG Benchmarks That Actually Matter: From BEIR to CRAG

The two flavors of RAG benchmarks

"RAG benchmark" is a phrase that hides two different questions. Did retrieval surface the right context? And, separately, was the final answer any good? Those questions need different datasets, different metrics, and different tooling, so mixing them up is how teams end up optimizing for a leaderboard number that has nothing to do with user experience.

Retrieval-only benchmarks judge the retriever against labeled query-document pairs. They are deterministic, cheap to run, and directly comparable across systems. They say nothing about whether the model wrote a good answer.

Answer-quality benchmarks judge the final generated response, which means you need either human raters or an LLM-as-judge. They cost more, have higher variance, and are far more sensitive to prompt wording, but they measure what the user actually sees. Serious RAG teams run both. If you do not have an internal labeled set yet, start with our guide on building a retrieval evaluation pipeline before reaching for public benchmarks.

Retrieval-only metrics worth knowing

These are deterministic, reproducible, and what every retrieval-only benchmark below reports. Definitions:

Recall@K: fraction of relevant documents that appear in the top K results. The single most important metric for RAG, because if the right chunk is not in the context window, the LLM cannot use it.
Precision@K: fraction of the top K results that are actually relevant. Matters when context window cost or noise is a concern.
MRR (Mean Reciprocal Rank): 1 divided by the rank of the first relevant result, averaged across queries. Captures how quickly a user or LLM sees something useful.
nDCG@K: position-discounted relevance score normalized against the ideal ranking. Supports graded relevance, not just binary.
MAP (Mean Average Precision): average of precision values computed at each relevant document's rank, averaged over queries. Standard in classical IR.
Hit Rate / Success@K: binary check of whether at least one relevant doc landed in the top K. Blunt but useful for sanity dashboards.
R-Precision: precision computed at K equal to the number of known relevant docs for that query. Useful when the number of relevant docs varies widely.

Answer-quality metrics (LLM-as-judge or human)

These measure the end-to-end system. Most are implemented as LLM-as-judge prompts today, with human raters as the gold standard for calibration.

Faithfulness / Groundedness: does every claim in the answer trace back to a retrieved source. The primary hallucination signal.
Answer Relevance: does the answer actually address the question that was asked.
Context Precision: of the retrieved chunks, which ones were actually useful for producing the answer.
Context Recall: did retrieval surface all the chunks needed to answer fully, measured against a reference answer.
Answer Correctness: factual match against a reference answer. Can be exact match, F1, semantic similarity, or judge-scored.
Answer Semantic Similarity: embedding-based similarity between generated and reference answers. Cheap proxy for correctness.
Citation Accuracy: do inline citations point to passages that actually support the cited claim.
Refusal Accuracy: does the system correctly abstain when the corpus lacks the answer. Central to CRAG and FRAMES.
Hallucination Rate: fraction of answers containing unsupported claims. Often reported alongside correctness.

Classical NLG metrics like BLEU, ROUGE, METEOR, and BERTScore still show up in older benchmarks. They correlate poorly with human judgment for RAG and should not be the primary signal for a modern system.

Named benchmarks worth running

There are dozens of public RAG benchmarks. These are the ones cited most often in serious retrieval and generation papers as of early 2026.

BEIR

18 retrieval datasets across bio, finance, news, scientific, and argument domains. BEIR is the de facto zero-shot retrieval benchmark. When a new embedding or reranker model launches, it reports BEIR numbers.

Metrics: nDCG@10 (primary), Recall@100, MAP.
Links: paper, code, dataset.

MTEB (Massive Text Embedding Benchmark)

56+ datasets across 8 task types including retrieval, reranking, and clustering. The MTEB leaderboard is where embedding model rankings are settled in practice.

Metrics: nDCG@10 for retrieval tasks; task-specific elsewhere (MAP, accuracy, V-measure).
Links: paper, code, leaderboard.

MS MARCO

Microsoft's question and passage dataset built from Bing query logs. Roughly 1M queries and 8.8M passages. The training and evaluation workhorse for dense retrievers.

Metrics: MRR@10 (passage ranking), Recall@1000.
Links: paper, site.

Natural Questions (NQ)

Real Google search queries paired with Wikipedia answer spans. A standard for open-domain QA and dense retrieval evaluation. The NQ-Open subset is the most commonly cited variant in RAG papers.

Metrics: Exact Match and F1 for QA; Recall@20 and Recall@100 for retrieval.
Links: paper, dataset.

HotpotQA

Multi-hop QA over Wikipedia, where answering requires reasoning across two or more documents. Tests whether your retriever can chain evidence instead of just matching keywords.

Metrics: Exact Match and F1 on answers; Supporting Fact F1 on gold sentences.
Links: paper, site.

TREC Deep Learning Track

NIST's annual ranking benchmark built on top of MS MARCO with deep, graded human relevance judgments instead of sparse binary labels. Multiple year editions are commonly reported alongside each other.

Metrics: nDCG@10 (primary), MAP, Recall@1000.
Links: TREC site, TREC-DL 2019 overview.

RAGAS

Not a fixed dataset but the most widely adopted open-source LLM-as-judge framework. If you want standard definitions of faithfulness and context relevance that your team and the broader community agree on, this is where to start.

Metrics: Faithfulness, Answer Relevance, Context Precision, Context Recall, Answer Correctness, Answer Semantic Similarity.
Links: paper, repo.

CRAG (Comprehensive RAG Benchmark)

Meta's benchmark used in the KDD Cup 2024 RAG challenge. 4,409 QA pairs across five domains (finance, sports, music, movies, open) with mock APIs and web search results. Explicitly designed to stress dynamic, long-tail, and unanswerable questions.

Metrics: Truthfulness score (rewards correct answers, penalizes hallucinations, neutral on "I don't know"), accuracy, hallucination rate, missing rate.
Links: paper, KDD Cup challenge.

MultiHop-RAG

A benchmark designed specifically for multi-hop queries against a news corpus, where evidence is spread across multiple documents. Useful for stress-testing query decomposition, agentic retrieval, and graph-style retrieval strategies.

Metrics: Hits@K, MAP@K, MRR@K on retrieval; generation accuracy end-to-end.
Links: paper, repo.

FRAMES

Google's factuality, retrieval, and reasoning benchmark. 824 multi-hop questions where each requires combining information from 2 to 15 Wikipedia articles. Evaluates retrieval, reasoning, and factuality jointly in a single end-to-end test.

Metrics: accuracy under different retrieval settings (no retrieval, oracle, BM25, multi-step).
Links: paper, dataset.

RGB (Retrieval-Augmented Generation Benchmark)

Focused on robustness. Tests four specific failure modes that production RAG systems actually hit: noise robustness, negative rejection, information integration, and counterfactual robustness. Slots cleanly alongside CRAG.

Metrics: accuracy, rejection rate, error detection rate, error correction rate (one per failure mode).
Links: paper.

FinanceBench

10,231 question/answer pairs on real public company filings, built by Patronus AI. The standard sanity check if you are shipping RAG against SEC filings or similar financial documents.

Metrics: accuracy with human-rated correctness, refusal handling.
Links: paper.

How to choose

If you are tuning embeddings or rerankers, use BEIR and MTEB plus your own labeled set. End-to-end benchmarks add noise you do not need at that layer.

If you are tuning prompts, generation, or the full pipeline, run RAGAS on your own data and add CRAG or FRAMES to cover hallucination and unanswerable-question handling.

If you are in a regulated vertical, skip the general benchmarks for headline numbers and build a domain-specific labeled set. Use FinanceBench or similar vertical benchmarks as a sanity check, not as ground truth.

Public benchmarks will not replace your own eval set

Public benchmarks are necessary for comparing across systems and for catching catastrophic regressions. They are not sufficient because your corpus, your query distribution, and your answer requirements are unique. A system that scores well on BEIR but badly on your 200 internal queries is going to produce unhappy users, regardless of the leaderboard.

The methodology in our retrieval evaluation post is what plugs the gap between BEIR-style numbers and what your users actually experience.

We watched a vendor swap their retriever for a model that was 4 points higher on MTEB and 7 points worse on our internal legal eval. Public benchmarks catch the big failures. Your own eval set catches the ones that matter.

How TypeGraph benchmarks

At TypeGraph we run BEIR, MTEB, and RAGAS against every retrieval and ranking change, alongside per-domain labeled sets for legal, financial, medical, and code. We publish the diffs so customers can see exactly where a release moves the needle, not just that a single aggregate number went up.