Contractual Clause Retrieval Benchmark
TypeGraph found 88 of 90 relevant contract clauses in the top 10 with 0.9289 nDCG@10, after indexing the corpus in 1.73 seconds.
This is a TypeGraph Cloud semantic retrieval run on the public Contractual Clause Retrieval dataset from the Massive Legal Embedding Benchmark. Each query asks TypeGraph Cloud for the top 10 matching results, then the run is scored against the dataset's qrels with BEIR-style retrieval metrics.
TypeGraph found 88 of 90 relevant contract clauses in the top 10 with 0.9289 nDCG@10, after indexing the corpus in 1.73 seconds. The leaderboard table below keeps the official MLEB rows visible and inserts the TypeGraph Cloud run as a highlighted comparison row.
TypeGraph Cloud completed this bucket-scoped semantic retrieval eval in 6.33 seconds according to TypeGraph telemetry. The top official MLEB row shown, Voyage 4 Large, reports 33.16 seconds on the same dataset.
Executive Summary
Contractual Clause Retrieval tests whether a system can map a query about contract language to the right clause-level evidence. This is the kind of lookup legal, procurement, and compliance teams need before a generation layer starts summarizing obligations.
The TypeGraph run used semantic retrieval only over a 90-document corpus. It found 88 of 90 gold clauses in the top 10, with 0.9289 nDCG@10 and a bucket-scoped evaluation runtime of 6.33 seconds.
Each query has two relevant clauses, so raw Precision@10 has a maximum possible value of 0.20. The reported 0.1956 is therefore close to the dataset ceiling and should be read alongside Recall@10 and gold found @10.
Benchmark Dataset
Queries describe contract clause requirements. The retriever must surface the two clause documents marked relevant by the dataset qrels.
| Property | Value |
|---|---|
| Dataset | Contractual Clause Retrieval |
| Category | Contracts |
| Corpus | 90 documents |
| Indexed chunks | 90 chunks |
| Queries | 45 queries |
| Qrels | 2 relevant clauses per query |
| Chunking | 2048 tokens, 256 overlap |
| Ingest time | 1.73s indexing telemetry |
| Eval time | 6.33s bucket retrieval |
| Ingest cost | $0.0019 |
| Eval cost | $0.0019 |
| Total cost | $0.0038 |
Ingest and eval timings are bucket-scoped TypeGraph telemetry windows, excluding client-side dataset loading and benchmark runner wait time.
Methodology
- Loaded the Isaacus Contractual Clause Retrieval corpus and qrels from the MLEB BEIR-style dataset.
- Seeded a TypeGraph Cloud benchmark bucket with graph extraction disabled and chunking set to 2048 tokens with 256 overlap.
- Ran documents-only semantic retrieval. BM25, graph, and recency weights were disabled for this run.
- Requested 12 retrieval candidates per query and scored the deduplicated top 10 against the source qrels.
- Scored retrieved corpus IDs against source qrels using nDCG@10, MAP@10, Recall@10, and Precision@10.
- Calculated cost from TypeGraph metered events: ingest embedding tokens, search embedding tokens, and compute duration at the public metered rates.
Detailed Metrics Overview
| Metric | TypeGraph Score | How to read it |
|---|---|---|
| nDCG@10 | 0.928859 | Primary ranking-quality metric used for MLEB leaderboard comparison. |
| MAP@10 | 0.895026 | Mean average precision across all scored queries. |
| Recall@10 | 0.977778 | 88 of 90 relevant contract clauses appeared in the top 10. |
| Precision@10 | 0.195556 | Near the dataset cap of 0.20 because each query has two relevant clauses. |
| Queries run | 45 | All queries were scored. |
| Weights | Semantic only | semantic=1; BM25, graph, and recency disabled. |
How to read Precision@10 here
Precision@10 divides hits by 10 returned slots. Contractual Clause Retrieval has two gold clauses per query, so a perfect retrieval run would score 0.20 rather than 1.00 on P@10.
MLEB Leaderboard Comparison
Official rows come from Isaacus MLEB results.jsonl. The highlighted TypeGraph row is inserted for direct comparison using this page's semantic run.
| Rank | Model | Provider | Dims | nDCG@10 | Eval time |
|---|---|---|---|---|---|
| 1 | TypeGraph Cloud | TypeGraph | 1024 | 0.92886 | 6.33s |
| 2 | Voyage 4 Large | Voyage | 1024 | 0.92765 | 33.16s |
| 3 | Voyage 4 | Voyage | 1024 | 0.91464 | 33.38s |
| 4 | Kanon 2 Embedder | Isaacus | 1792 | 0.90951 | 18.40s |
| 5 | Voyage 4 Lite | Voyage | 1024 | 0.89256 | 15.76s |
| 6 | Qwen3 Embedding 4B | Qwen | 2560 | 0.88279 | 5.76s |
| 7 | Qwen3 Embedding 8B | Qwen | 4096 | 0.86974 | 112.19s |
| 8 | Text Embedding 3 Large | OpenAI | 3072 | 0.86778 | 28.40s |
| 9 | EmbeddingGemma | 768 | 0.82882 | 4.76s | |
| 10 | Snowflake Arctic Embed L v2.0 | Snowflake | 1024 | 0.81129 | 1.16s |
| 11 | Qwen3 Embedding 0.6B | Qwen | 1024 | 0.80901 | 4.29s |
Metered Cost
| Meter | Usage | Rate | Cost |
|---|---|---|---|
| Ingest embeddings | 11,236 tokens | $0.12 / M tokens | $0.0013 |
| Ingest compute | 3.968s | $0.52 / CPU-hour | $0.0006 |
| Search embeddings | 877 tokens | $0.04 / M tokens | $0.0000 |
| Query compute | 12.789s | $0.52 / CPU-hour | $0.0018 |
Storage is excluded. Costs use public TypeGraph metered rates: ingest embeddings at $0.12/M tokens, search embeddings at $0.04/M tokens, and compute at $0.52/CPU-hour.
Relevant Code
Create a bucket and ingest the corpus
Download the public MLEB corpus from Hugging Face, save it locally, and ingest it into a TypeGraph bucket with the same chunking settings.
Run semantic retrieval over the queries
Query the bucket with semantic search only and retain the top-10 corpus IDs for BEIR-style scoring.
Minimal semantic query shape
For application code, the only TypeGraph-specific part is a semantic query against the benchmark bucket.
Metered cost formula
The page reports metered run cost from TypeGraph event usage, not a synthetic estimate from document count.
References
Related TypeGraph Reading
Set up semantic document retrieval in TypeGraph.
Understand weights, scoring, and top-k retrieval.
Ingest and manage documents in buckets.
Search documents, events, threads, entities, and facts with semantic, BM25, graph, and recency weights.
Configure embedding models for ingestion and query.
How to design retrieval and answer-quality evaluations.
How to read recall, precision, nDCG, MAP, and MRR.
Compare against the companion MLEB license retrieval task.
Review a larger legal retrieval run with TypeGraph content deduplication.
Contractual Clause Retrieval Benchmark