Legal RAGLegal retrieval

Legal RAG Bench Retrieval Benchmark

Published May 13, 2026 · Updated May 13, 2026

TypeGraph scored 0.6556 nDCG@10 and 0.9000 Recall@10 on Legal RAG Bench after processing all 4,876 source passages and indexing 4,658 unique-content documents.

What this page shows

This is a TypeGraph Cloud documents-only semantic retrieval run on the public Legal RAG Bench dataset. The run uses BEIR-style retrieval metrics at cutoff 10 and excludes graph extraction, BM25, and recency scoring.

TypeGraph processed all 4,876 source passages. Its content deduplication collapsed 218 exact duplicate passages into unique indexed records, leaving 4,658 document chunks while preserving all gold passages used by the 100 evaluation questions.

Semantic eval duration

13.26s

19.5x faster than the top official Legal RAG Bench leaderboard row shown

TypeGraph Cloud completed this bucket-scoped semantic retrieval eval in 13.26 seconds according to TypeGraph telemetry. The top official Legal RAG Bench row shown, Kanon 2 Embedder, reports 258.42 seconds on the same dataset; treat this as leaderboard-reported runtime context rather than a controlled hardware benchmark.

nDCG@10

0.6556

Ranking quality across all 100 scored Legal RAG queries

Recall@10

90.00%

90 of 100 gold passages were retrieved in the scored top 10

Eval Time

13.26s

Bucket-scoped semantic retrieval telemetry

Ingest Time

125.52s

Bucket-scoped indexing telemetry

Deduped

218 passages

Exact duplicate-content source passages collapsed before indexing

Cost

$0.2080

Metered ingest plus query execution cost

Executive Summary

Legal RAG Bench tests retrieval over legal instruction passages where queries ask for the specific passage that supports an answer. The corpus includes repeated boilerplate and update-note passages, which makes exact content deduplication useful for reducing redundant embeddings and duplicate retrieval noise.

The TypeGraph run used documents-only semantic retrieval with 1024-dimensional Voyage 4 Large embeddings stored as pgvector halfvec. It scored 0.6556 nDCG@10, 0.5772 MAP@10, and 0.9000 Recall@10 across the 100 scored queries.

The deduplication behavior is part of the benchmark result: all 4,876 source passages were processed, 218 exact duplicate-content passages were collapsed, and all gold passages for the evaluation questions remained represented in the indexed corpus.

Benchmark Dataset

Queries ask legal questions with one relevant passage ID. The retriever must surface the source passage that supports the answer.

Property	Value
Dataset	Legal RAG Bench
Category	Legal RAG
Corpus	4,876 source passages; 4,658 unique indexed documents
Indexed chunks	4,658 chunks
Queries	100 queries
Qrels	1 relevant passage per query
Chunking	2048 tokens, 256 overlap
Ingest time	125.52s indexing telemetry
Eval time	13.26s bucket retrieval
Ingest cost	$0.2043
Eval cost	$0.0036
Total cost	$0.2080

TypeGraph processed the full 4,876-passage corpus. The indexed document count is lower because content deduplication collapsed 218 exact duplicate passages; all gold passages used by the 100 eval questions remained represented.

Methodology

Loaded the Isaacus Legal RAG Bench corpus and QA relevance pairs from the public dataset.
Seeded a TypeGraph Cloud benchmark bucket with graph extraction disabled and chunking set to 2048 tokens with 256 overlap.
Used bucket-level content deduplication, which processed all 4,876 source passages and indexed 4,658 unique-content document chunks.
Ran documents-only semantic retrieval. BM25, graph, and recency weights were disabled for this run.
Requested 12 retrieval candidates per query and scored the deduplicated top 10 against the source relevance pairs.
Calculated cost from TypeGraph metered events: ingest embedding tokens, search embedding tokens, and compute duration at the public metered rates.

Detailed Metrics Overview

Before we dive into the leaderboard, here's a quick overview of the metrics, TypeGraph Cloud's scores, and how to read them:

Metric	TypeGraph Score	How to read it
nDCG@10	0.655558	Primary ranking-quality metric for this retrieval run.
MAP@10	0.577230	Mean average precision across all scored queries.
Recall@10	0.900000	90 of 100 relevant Legal RAG passages appeared in the top 10.
Precision@10	0.090000	Near the dataset cap of 0.10 because each query has one relevant passage.
Queries run	100	All Legal RAG questions were scored.
Weights	Semantic only	semantic=1; BM25, graph, and recency disabled.

How to read Precision@10 here

Precision@10 divides hits by 10 returned slots. Legal RAG Bench has one gold passage per query, so a perfect retrieval run would score 0.10 rather than 1.00 on P@10.

Legal RAG Bench Leaderboard Comparison

Official rows come from Isaacus MLEB results.jsonl. The highlighted TypeGraph row is inserted for direct comparison using this page's documents-only semantic run.

Rank	Model	Provider	Dims	nDCG@10	Eval time
1	Kanon 2 Embedder	Isaacus	1792	0.67950	258.42s
2	TypeGraph Cloud	TypeGraph	1024	0.65556	13.26s
3	Voyage 4 Large	Voyage	1024	0.64804	303.57s
4	Voyage 3.5	Voyage	1024	0.60309	495.30s
5	Voyage 3 Large	Voyage	1024	0.60102	510.13s
6	Voyage 4	Voyage	1024	0.56626	1231.28s
7	Voyage 4 Lite	Voyage	1024	0.51204	1247.21s
8	Qwen3 Embedding 8B	Qwen	4096	0.49947	204.35s
9	Qwen3 Embedding 4B	Qwen	2560	0.45435	129.95s
10	Voyage Law 2	Voyage	1024	0.44890	1375.22s
11	Jina Embeddings v5 Text Small	Jina	1024	0.42274	28.31s
12	Gemini Embedding 001	Google	3072	0.42196	399.35s
13	Snowflake Arctic Embed L v2.0	Snowflake	-	0.40161	26.74s
14	Text Embedding 3 Large	OpenAI	3072	0.39838	102.47s

Metered Cost

Ingest

$0.2043

Query

$0.0036

Total

$0.2080

Meter	Usage	Rate	Cost
Ingest embeddings	1,448,728 tokens	$0.12 / M tokens	$0.1738
Ingest compute	211.163s	$0.52 / CPU-hour	$0.0305
Search embeddings	5,899 tokens	$0.04 / M tokens	$0.0002
Query compute	23.631s	$0.52 / CPU-hour	$0.0034

Storage is excluded. Costs use public TypeGraph metered rates: ingest embeddings at $0.12/M tokens, search embeddings at $0.04/M tokens, and compute at $0.52/CPU-hour.

Relevant Code

Create a bucket and ingest Legal RAG Bench

The benchmark bucket uses 2048-token chunks and content deduplication so exact duplicate passages are processed once in the retrieval index.

seed-legal-rag.ts

import { readFile } from 'node:fs/promises'
import { typegraphInit } from '@typegraph-ai/sdk'

type CorpusRow = {
  id: string
  title?: string
  text: string
}

const typegraph = await typegraphInit({
  apiKey: process.env.TYPEGRAPH_API_KEY!,
  tenantId: process.env.TYPEGRAPH_TENANT_ID!,
})

const corpus = JSON.parse(
  await readFile('./data/legal-rag-bench/corpus.json', 'utf8'),
) as CorpusRow[]

const bucket = await typegraph.bucket.create({
  name: 'legal-rag-bench',
  indexDefaults: {
    chunkSize: 2048,
    chunkOverlap: 256,
    graphExtraction: false,
    deduplicateBy: ['content'],
  },
})

await typegraph.document.ingest(
  corpus.map((row) => ({
    id: row.id,
    name: row.title ?? row.id,
    content: row.text,
    metadata: { corpusId: row.id },
  })),
  { bucketId: bucket.id },
)

Run documents-only semantic retrieval

The eval asks for 12 candidates and scores the deduplicated top 10 against the Legal RAG relevance IDs.

eval-legal-rag.ts

const response = await typegraph.search(query.question, {
  buckets: [process.env.TYPEGRAPH_BUCKET_ID!],
  resources: ['documents'],
  weights: { semantic: 1, bm25: false, graph: false, recency: false },
  limit: 12,
})

const retrievedCorpusIds = response.results.chunks.map((chunk) =>
  String(chunk.metadata?.corpusId ?? chunk.document.name),
)

Metered cost formula

The page reports metered TypeGraph usage from telemetry rather than benchmark runner wall time.

cost.ts

const ingestCost =
  (embedIngestTokens / 1_000_000) * 0.12 +
  (ingestComputeMs / 3_600_000) * 0.52

const queryCost =
  (embedQueryTokens / 1_000_000) * 0.04 +
  (queryComputeMs / 3_600_000) * 0.52

References

Legal RAG Bench dataset

Hugging Face dataset card for the Legal RAG corpus and QA pairs.

Isaacus MLEB leaderboard

Public leaderboard for the Massive Legal Embedding Benchmark.

Official MLEB results JSONL

Source for official comparison rows in the leaderboard table.