Legal RAGLegal retrieval

Legal RAG Bench Retrieval Benchmark

Published May 13, 2026 · Updated May 13, 2026

TypeGraph scored 0.6556 nDCG@10 and 0.9000 Recall@10 on Legal RAG Bench after processing all 4,876 source passages and indexing 4,658 unique-content documents.

What this page shows

This is a TypeGraph Cloud documents-only semantic retrieval run on the public Legal RAG Bench dataset. The run uses BEIR-style retrieval metrics at cutoff 10 and excludes graph extraction, BM25, and recency scoring.

TypeGraph processed all 4,876 source passages. Its content deduplication collapsed 218 exact duplicate passages into unique indexed records, leaving 4,658 document chunks while preserving all gold passages used by the 100 evaluation questions.

Semantic eval duration
13.26s
19.5x faster than the top official Legal RAG Bench leaderboard row shown

TypeGraph Cloud completed this bucket-scoped semantic retrieval eval in 13.26 seconds according to TypeGraph telemetry. The top official Legal RAG Bench row shown, Kanon 2 Embedder, reports 258.42 seconds on the same dataset; treat this as leaderboard-reported runtime context rather than a controlled hardware benchmark.

nDCG@10
0.6556
Ranking quality across all 100 scored Legal RAG queries
Recall@10
90.00%
90 of 100 gold passages were retrieved in the scored top 10
Eval Time
13.26s
Bucket-scoped semantic retrieval telemetry
Ingest Time
125.52s
Bucket-scoped indexing telemetry
Deduped
218 passages
Exact duplicate-content source passages collapsed before indexing
Cost
$0.2080
Metered ingest plus query execution cost

Executive Summary

Legal RAG Bench tests retrieval over legal instruction passages where queries ask for the specific passage that supports an answer. The corpus includes repeated boilerplate and update-note passages, which makes exact content deduplication useful for reducing redundant embeddings and duplicate retrieval noise.

The TypeGraph run used documents-only semantic retrieval with 1024-dimensional Voyage 4 Large embeddings stored as pgvector halfvec. It scored 0.6556 nDCG@10, 0.5772 MAP@10, and 0.9000 Recall@10 across the 100 scored queries.

The deduplication behavior is part of the benchmark result: all 4,876 source passages were processed, 218 exact duplicate-content passages were collapsed, and all gold passages for the evaluation questions remained represented in the indexed corpus.

Benchmark Dataset

Queries ask legal questions with one relevant passage ID. The retriever must surface the source passage that supports the answer.

PropertyValue
DatasetLegal RAG Bench
CategoryLegal RAG
Corpus4,876 source passages; 4,658 unique indexed documents
Indexed chunks4,658 chunks
Queries100 queries
Qrels1 relevant passage per query
Chunking2048 tokens, 256 overlap
Ingest time125.52s indexing telemetry
Eval time13.26s bucket retrieval
Ingest cost$0.2043
Eval cost$0.0036
Total cost$0.2080

TypeGraph processed the full 4,876-passage corpus. The indexed document count is lower because content deduplication collapsed 218 exact duplicate passages; all gold passages used by the 100 eval questions remained represented.

Methodology

  1. Loaded the Isaacus Legal RAG Bench corpus and QA relevance pairs from the public dataset.
  2. Seeded a TypeGraph Cloud benchmark bucket with graph extraction disabled and chunking set to 2048 tokens with 256 overlap.
  3. Used bucket-level content deduplication, which processed all 4,876 source passages and indexed 4,658 unique-content document chunks.
  4. Ran documents-only semantic retrieval. BM25, graph, and recency weights were disabled for this run.
  5. Requested 12 retrieval candidates per query and scored the deduplicated top 10 against the source relevance pairs.
  6. Calculated cost from TypeGraph metered events: ingest embedding tokens, search embedding tokens, and compute duration at the public metered rates.

Detailed Metrics Overview

Before we dive into the leaderboard, here's a quick overview of the metrics, TypeGraph Cloud's scores, and how to read them:
MetricTypeGraph ScoreHow to read it
nDCG@100.655558Primary ranking-quality metric for this retrieval run.
MAP@100.577230Mean average precision across all scored queries.
Recall@100.90000090 of 100 relevant Legal RAG passages appeared in the top 10.
Precision@100.090000Near the dataset cap of 0.10 because each query has one relevant passage.
Queries run100All Legal RAG questions were scored.
WeightsSemantic onlysemantic=1; BM25, graph, and recency disabled.

How to read Precision@10 here

Precision@10 divides hits by 10 returned slots. Legal RAG Bench has one gold passage per query, so a perfect retrieval run would score 0.10 rather than 1.00 on P@10.

Legal RAG Bench Leaderboard Comparison

Official rows come from Isaacus MLEB results.jsonl. The highlighted TypeGraph row is inserted for direct comparison using this page's documents-only semantic run.

RankModelProviderDimsnDCG@10Eval time
1Kanon 2 EmbedderIsaacus17920.67950258.42s
2TypeGraph CloudTypeGraph10240.6555613.26s
3Voyage 4 LargeVoyage10240.64804303.57s
4Voyage 3.5Voyage10240.60309495.30s
5Voyage 3 LargeVoyage10240.60102510.13s
6Voyage 4Voyage10240.566261231.28s
7Voyage 4 LiteVoyage10240.512041247.21s
8Qwen3 Embedding 8BQwen40960.49947204.35s
9Qwen3 Embedding 4BQwen25600.45435129.95s
10Voyage Law 2Voyage10240.448901375.22s
11Jina Embeddings v5 Text SmallJina10240.4227428.31s
12Gemini Embedding 001Google30720.42196399.35s
13Snowflake Arctic Embed L v2.0Snowflake-0.4016126.74s
14Text Embedding 3 LargeOpenAI30720.39838102.47s

Metered Cost

Ingest
$0.2043
Query
$0.0036
Total
$0.2080
MeterUsageRateCost
Ingest embeddings1,448,728 tokens$0.12 / M tokens$0.1738
Ingest compute211.163s$0.52 / CPU-hour$0.0305
Search embeddings5,899 tokens$0.04 / M tokens$0.0002
Query compute23.631s$0.52 / CPU-hour$0.0034

Storage is excluded. Costs use public TypeGraph metered rates: ingest embeddings at $0.12/M tokens, search embeddings at $0.04/M tokens, and compute at $0.52/CPU-hour.

Relevant Code

Create a bucket and ingest Legal RAG Bench

The benchmark bucket uses 2048-token chunks and content deduplication so exact duplicate passages are processed once in the retrieval index.

seed-legal-rag.ts
import { readFile } from 'node:fs/promises'
import { typegraphInit } from '@typegraph-ai/sdk'

type CorpusRow = {
  id: string
  title?: string
  text: string
}

const typegraph = await typegraphInit({
  apiKey: process.env.TYPEGRAPH_API_KEY!,
  tenantId: process.env.TYPEGRAPH_TENANT_ID!,
})

const corpus = JSON.parse(
  await readFile('./data/legal-rag-bench/corpus.json', 'utf8'),
) as CorpusRow[]

const bucket = await typegraph.bucket.create({
  name: 'legal-rag-bench',
  indexDefaults: {
    chunkSize: 2048,
    chunkOverlap: 256,
    graphExtraction: false,
    deduplicateBy: ['content'],
  },
})

await typegraph.document.ingest(
  corpus.map((row) => ({
    id: row.id,
    name: row.title ?? row.id,
    content: row.text,
    metadata: { corpusId: row.id },
  })),
  { bucketId: bucket.id },
)

Run documents-only semantic retrieval

The eval asks for 12 candidates and scores the deduplicated top 10 against the Legal RAG relevance IDs.

eval-legal-rag.ts
const response = await typegraph.search(query.question, {
  buckets: [process.env.TYPEGRAPH_BUCKET_ID!],
  resources: ['documents'],
  weights: { semantic: 1, bm25: false, graph: false, recency: false },
  limit: 12,
})

const retrievedCorpusIds = response.results.chunks.map((chunk) =>
  String(chunk.metadata?.corpusId ?? chunk.document.name),
)

Metered cost formula

The page reports metered TypeGraph usage from telemetry rather than benchmark runner wall time.

cost.ts
const ingestCost =
  (embedIngestTokens / 1_000_000) * 0.12 +
  (ingestComputeMs / 3_600_000) * 0.52

const queryCost =
  (embedQueryTokens / 1_000_000) * 0.04 +
  (queryComputeMs / 3_600_000) * 0.52

References

Related TypeGraph Reading

Legal RAG Bench Retrieval Benchmark

Legal RAG Bench Retrieval Benchmark | TypeGraph