Legal RAG Bench Retrieval Benchmark
TypeGraph scored 0.6556 nDCG@10 and 0.9000 Recall@10 on Legal RAG Bench after processing all 4,876 source passages and indexing 4,658 unique-content documents.
This is a TypeGraph Cloud documents-only semantic retrieval run on the public Legal RAG Bench dataset. The run uses BEIR-style retrieval metrics at cutoff 10 and excludes graph extraction, BM25, and recency scoring.
TypeGraph processed all 4,876 source passages. Its content deduplication collapsed 218 exact duplicate passages into unique indexed records, leaving 4,658 document chunks while preserving all gold passages used by the 100 evaluation questions.
TypeGraph Cloud completed this bucket-scoped semantic retrieval eval in 13.26 seconds according to TypeGraph telemetry. The top official Legal RAG Bench row shown, Kanon 2 Embedder, reports 258.42 seconds on the same dataset; treat this as leaderboard-reported runtime context rather than a controlled hardware benchmark.
Executive Summary
Legal RAG Bench tests retrieval over legal instruction passages where queries ask for the specific passage that supports an answer. The corpus includes repeated boilerplate and update-note passages, which makes exact content deduplication useful for reducing redundant embeddings and duplicate retrieval noise.
The TypeGraph run used documents-only semantic retrieval with 1024-dimensional Voyage 4 Large embeddings stored as pgvector halfvec. It scored 0.6556 nDCG@10, 0.5772 MAP@10, and 0.9000 Recall@10 across the 100 scored queries.
The deduplication behavior is part of the benchmark result: all 4,876 source passages were processed, 218 exact duplicate-content passages were collapsed, and all gold passages for the evaluation questions remained represented in the indexed corpus.
Benchmark Dataset
Queries ask legal questions with one relevant passage ID. The retriever must surface the source passage that supports the answer.
| Property | Value |
|---|---|
| Dataset | Legal RAG Bench |
| Category | Legal RAG |
| Corpus | 4,876 source passages; 4,658 unique indexed documents |
| Indexed chunks | 4,658 chunks |
| Queries | 100 queries |
| Qrels | 1 relevant passage per query |
| Chunking | 2048 tokens, 256 overlap |
| Ingest time | 125.52s indexing telemetry |
| Eval time | 13.26s bucket retrieval |
| Ingest cost | $0.2043 |
| Eval cost | $0.0036 |
| Total cost | $0.2080 |
TypeGraph processed the full 4,876-passage corpus. The indexed document count is lower because content deduplication collapsed 218 exact duplicate passages; all gold passages used by the 100 eval questions remained represented.
Methodology
- Loaded the Isaacus Legal RAG Bench corpus and QA relevance pairs from the public dataset.
- Seeded a TypeGraph Cloud benchmark bucket with graph extraction disabled and chunking set to 2048 tokens with 256 overlap.
- Used bucket-level content deduplication, which processed all 4,876 source passages and indexed 4,658 unique-content document chunks.
- Ran documents-only semantic retrieval. BM25, graph, and recency weights were disabled for this run.
- Requested 12 retrieval candidates per query and scored the deduplicated top 10 against the source relevance pairs.
- Calculated cost from TypeGraph metered events: ingest embedding tokens, search embedding tokens, and compute duration at the public metered rates.
Detailed Metrics Overview
| Metric | TypeGraph Score | How to read it |
|---|---|---|
| nDCG@10 | 0.655558 | Primary ranking-quality metric for this retrieval run. |
| MAP@10 | 0.577230 | Mean average precision across all scored queries. |
| Recall@10 | 0.900000 | 90 of 100 relevant Legal RAG passages appeared in the top 10. |
| Precision@10 | 0.090000 | Near the dataset cap of 0.10 because each query has one relevant passage. |
| Queries run | 100 | All Legal RAG questions were scored. |
| Weights | Semantic only | semantic=1; BM25, graph, and recency disabled. |
How to read Precision@10 here
Precision@10 divides hits by 10 returned slots. Legal RAG Bench has one gold passage per query, so a perfect retrieval run would score 0.10 rather than 1.00 on P@10.
Legal RAG Bench Leaderboard Comparison
Official rows come from Isaacus MLEB results.jsonl. The highlighted TypeGraph row is inserted for direct comparison using this page's documents-only semantic run.
| Rank | Model | Provider | Dims | nDCG@10 | Eval time |
|---|---|---|---|---|---|
| 1 | Kanon 2 Embedder | Isaacus | 1792 | 0.67950 | 258.42s |
| 2 | TypeGraph Cloud | TypeGraph | 1024 | 0.65556 | 13.26s |
| 3 | Voyage 4 Large | Voyage | 1024 | 0.64804 | 303.57s |
| 4 | Voyage 3.5 | Voyage | 1024 | 0.60309 | 495.30s |
| 5 | Voyage 3 Large | Voyage | 1024 | 0.60102 | 510.13s |
| 6 | Voyage 4 | Voyage | 1024 | 0.56626 | 1231.28s |
| 7 | Voyage 4 Lite | Voyage | 1024 | 0.51204 | 1247.21s |
| 8 | Qwen3 Embedding 8B | Qwen | 4096 | 0.49947 | 204.35s |
| 9 | Qwen3 Embedding 4B | Qwen | 2560 | 0.45435 | 129.95s |
| 10 | Voyage Law 2 | Voyage | 1024 | 0.44890 | 1375.22s |
| 11 | Jina Embeddings v5 Text Small | Jina | 1024 | 0.42274 | 28.31s |
| 12 | Gemini Embedding 001 | 3072 | 0.42196 | 399.35s | |
| 13 | Snowflake Arctic Embed L v2.0 | Snowflake | - | 0.40161 | 26.74s |
| 14 | Text Embedding 3 Large | OpenAI | 3072 | 0.39838 | 102.47s |
Metered Cost
| Meter | Usage | Rate | Cost |
|---|---|---|---|
| Ingest embeddings | 1,448,728 tokens | $0.12 / M tokens | $0.1738 |
| Ingest compute | 211.163s | $0.52 / CPU-hour | $0.0305 |
| Search embeddings | 5,899 tokens | $0.04 / M tokens | $0.0002 |
| Query compute | 23.631s | $0.52 / CPU-hour | $0.0034 |
Storage is excluded. Costs use public TypeGraph metered rates: ingest embeddings at $0.12/M tokens, search embeddings at $0.04/M tokens, and compute at $0.52/CPU-hour.
Relevant Code
Create a bucket and ingest Legal RAG Bench
The benchmark bucket uses 2048-token chunks and content deduplication so exact duplicate passages are processed once in the retrieval index.
Run documents-only semantic retrieval
The eval asks for 12 candidates and scores the deduplicated top 10 against the Legal RAG relevance IDs.
Metered cost formula
The page reports metered TypeGraph usage from telemetry rather than benchmark runner wall time.
References
Related TypeGraph Reading
Set up semantic document retrieval in TypeGraph.
Understand weights, scoring, and top-k retrieval.
Ingest and manage documents in buckets.
Search documents, events, threads, entities, and facts with semantic, BM25, graph, and recency weights.
How to read recall, precision, nDCG, MAP, and MRR.
Compare against an MLEB contracts retrieval task.
Compare against another legal retrieval task.
Legal RAG Bench Retrieval Benchmark