EvaluationTestingRAGQuality Assurance

Evaluating RAG Retrieval Quality: Building an Automated Test Suite for Recall, Precision, and MRR

Ryan Musser
Ryan Musser
Founder

The vibes-based evaluation trap

Here's how most teams evaluate their RAG system: someone runs 5-10 queries manually, skims the results, and says "looks good." The chunking strategy gets changed. The embedding model gets swapped. The retrieval parameters get tuned. Nobody re-runs those queries. A month later, users start complaining that answers are worse, and nobody can pinpoint when the regression was introduced.

You wouldn't ship application code without tests. You shouldn't ship retrieval changes without measurable evaluation either. The good news is that retrieval evaluation is a well-studied problem with established metrics and straightforward implementation. The bad news is that almost nobody in the RAG space is actually doing it.

Building a labeled evaluation dataset

Every retrieval evaluation starts with a labeled dataset: a set of queries paired with their expected relevant documents (or passages). This is the part that feels tedious, but it's the foundation everything else depends on.

Start small - 50-100 query-passage pairs that represent your most important use cases. For each query, identify the specific chunks or documents that contain the correct answer. You can build this dataset manually (have domain experts annotate), semi-automatically (use your existing RAG system's outputs and have humans verify), or synthetically (use an LLM to generate questions from your documents and use the source document as the ground truth). The BEIR benchmark suite provides useful templates for how to structure evaluation datasets across different domains.

The key constraint: your evaluation dataset must be representative of real user queries, not cherry-picked easy cases. Include the hard queries - the ones that require understanding context, the ones with ambiguous terminology, the ones that span multiple documents.

The metrics that matter: Recall@K, MRR, and nDCG

Recall@K answers: "Of all the relevant documents, what fraction appeared in my top K results?" If K=5 and there are 3 relevant documents, and your retriever found 2 of them in the top 5, your Recall@5 is 0.67. This is the single most important metric for RAG because if the relevant chunk doesn't make it into the context window, the LLM can't use it - no amount of prompt engineering fixes a retrieval miss.

Mean Reciprocal Rank (MRR) answers: "How high does the first relevant result appear?" If the first relevant result is at position 1, MRR is 1.0. If it's at position 3, MRR is 0.33. MRR matters because of the "lost in the middle" problem - LLMs pay more attention to information at the beginning and end of the context window. A relevant chunk at position 1 is more useful than one at position 5, even if both make it into the context.

Normalized Discounted Cumulative Gain (nDCG) is the most comprehensive metric - it considers both the relevance level of each result and its position. Unlike binary relevance (relevant or not), nDCG supports graded relevance (highly relevant, somewhat relevant, not relevant). This is useful when some documents contain a complete answer while others contain only partial information.

Detecting regressions: the retrieval CI/CD pipeline

Metrics are only useful if you compute them consistently and act on changes. Here's the practical setup:

Create a retrieval evaluation script that loads your labeled dataset, runs every query against your retrieval pipeline, computes Recall@K, MRR, and nDCG, and outputs a summary. Run this script in CI whenever you change anything that affects retrieval: chunking strategies, embedding models, retrieval parameters, or index configuration.

Set threshold alerts: if Recall@5 drops by more than 2% compared to the baseline, the CI check fails. This catches the silent regressions that vibes-based evaluation misses. Store historical metric values so you can plot trends over time - gradual degradation is harder to spot than sudden drops but equally damaging.

Stratified evaluation: not all queries are equal

Aggregate metrics hide important patterns. A chunking change might improve retrieval for short factual queries while devastating performance on complex analytical queries. The aggregate Recall@5 might stay flat, masking a severe regression on your most important query type.

Stratify your evaluation dataset by query type (factual, analytical, multi-hop), document type (technical docs, support articles, legal contracts), and query complexity (single entity, multi-entity, temporal). Report metrics per stratum, not just overall. This is where most teams stop too early - they see good aggregate numbers and miss category-specific regressions.

The cost of not measuring

We've seen teams lose weeks of engineering time debugging "the RAG is worse" complaints that trace back to a chunking parameter change made months ago. With an automated evaluation pipeline, that regression would have been caught in the PR that introduced it - a 5-minute CI check versus weeks of forensic debugging.

We integrated retrieval evaluation into our CI pipeline and caught a 12% Recall@5 regression on the same day it was introduced - a chunking overlap parameter had been accidentally removed. Without automated evaluation, that would have been a customer-reported issue weeks later.

Getting started with retrieval evaluation

Start with 50 labeled query-passage pairs and Recall@5. That's it. You can add MRR, nDCG, stratification, and CI integration incrementally. The first 50 labeled pairs will teach you more about your retrieval quality than months of manual testing. At TypeGraph, we publish benchmark results across legal, financial, medical, and code domains using exactly this methodology - and we make it easy to run the same evaluations on your own data with our evaluation toolkit.

Evaluating RAG Retrieval Quality: Building an Automated Test Suite for Recall, Precision, and MRR | TypeGraph