ContractsMLEB contracts retrieval

Contractual Clause Retrieval Benchmark

Published April 28, 2026 · Updated May 13, 2026

TypeGraph found 88 of 90 relevant contract clauses in the top 10 with 0.9289 nDCG@10, after indexing the corpus in 1.73 seconds.

What this page shows

This is a TypeGraph Cloud semantic retrieval run on the public Contractual Clause Retrieval dataset from the Massive Legal Embedding Benchmark. Each query asks TypeGraph Cloud for the top 10 matching results, then the run is scored against the dataset's qrels with BEIR-style retrieval metrics.

TypeGraph found 88 of 90 relevant contract clauses in the top 10 with 0.9289 nDCG@10, after indexing the corpus in 1.73 seconds. The leaderboard table below keeps the official MLEB rows visible and inserts the TypeGraph Cloud run as a highlighted comparison row.

Semantic eval duration

6.33s

5.2x faster than the top official Contractual Clause leaderboard row shown

TypeGraph Cloud completed this bucket-scoped semantic retrieval eval in 6.33 seconds according to TypeGraph telemetry. The top official MLEB row shown, Voyage 4 Large, reports 33.16 seconds on the same dataset.

nDCG@10

0.9289

Ranking quality across all 45 scored queries

Recall@10

97.78%

88 of 90 gold clauses were retrieved

Eval Time

6.33s

Bucket-scoped semantic retrieval telemetry

Ingest Time

1.73s

Bucket-scoped indexing telemetry

Corpus

90 docs

90 indexed chunks after 2048-token chunking

Cost

$0.0038

Metered ingest plus query execution cost

Executive Summary

Contractual Clause Retrieval tests whether a system can map a query about contract language to the right clause-level evidence. This is the kind of lookup legal, procurement, and compliance teams need before a generation layer starts summarizing obligations.

The TypeGraph run used semantic retrieval only over a 90-document corpus. It found 88 of 90 gold clauses in the top 10, with 0.9289 nDCG@10 and a bucket-scoped evaluation runtime of 6.33 seconds.

Each query has two relevant clauses, so raw Precision@10 has a maximum possible value of 0.20. The reported 0.1956 is therefore close to the dataset ceiling and should be read alongside Recall@10 and gold found @10.

Benchmark Dataset

Queries describe contract clause requirements. The retriever must surface the two clause documents marked relevant by the dataset qrels.

Property	Value
Dataset	Contractual Clause Retrieval
Category	Contracts
Corpus	90 documents
Indexed chunks	90 chunks
Queries	45 queries
Qrels	2 relevant clauses per query
Chunking	2048 tokens, 256 overlap
Ingest time	1.73s indexing telemetry
Eval time	6.33s bucket retrieval
Ingest cost	$0.0019
Eval cost	$0.0019
Total cost	$0.0038

Ingest and eval timings are bucket-scoped TypeGraph telemetry windows, excluding client-side dataset loading and benchmark runner wait time.

Methodology

Loaded the Isaacus Contractual Clause Retrieval corpus and qrels from the MLEB BEIR-style dataset.
Seeded a TypeGraph Cloud benchmark bucket with graph extraction disabled and chunking set to 2048 tokens with 256 overlap.
Ran documents-only semantic retrieval. BM25, graph, and recency weights were disabled for this run.
Requested 12 retrieval candidates per query and scored the deduplicated top 10 against the source qrels.
Scored retrieved corpus IDs against source qrels using nDCG@10, MAP@10, Recall@10, and Precision@10.
Calculated cost from TypeGraph metered events: ingest embedding tokens, search embedding tokens, and compute duration at the public metered rates.

Detailed Metrics Overview

Before we dive into the leaderboard, here's a quick overview of the metrics, TypeGraph Cloud's scores, and how to read them:

Metric	TypeGraph Score	How to read it
nDCG@10	0.928859	Primary ranking-quality metric used for MLEB leaderboard comparison.
MAP@10	0.895026	Mean average precision across all scored queries.
Recall@10	0.977778	88 of 90 relevant contract clauses appeared in the top 10.
Precision@10	0.195556	Near the dataset cap of 0.20 because each query has two relevant clauses.
Queries run	45	All queries were scored.
Weights	Semantic only	semantic=1; BM25, graph, and recency disabled.

How to read Precision@10 here

Precision@10 divides hits by 10 returned slots. Contractual Clause Retrieval has two gold clauses per query, so a perfect retrieval run would score 0.20 rather than 1.00 on P@10.

MLEB Leaderboard Comparison

Official rows come from Isaacus MLEB results.jsonl. The highlighted TypeGraph row is inserted for direct comparison using this page's semantic run.

Rank	Model	Provider	Dims	nDCG@10	Eval time
1	TypeGraph Cloud	TypeGraph	1024	0.92886	6.33s
2	Voyage 4 Large	Voyage	1024	0.92765	33.16s
3	Voyage 4	Voyage	1024	0.91464	33.38s
4	Kanon 2 Embedder	Isaacus	1792	0.90951	18.40s
5	Voyage 4 Lite	Voyage	1024	0.89256	15.76s
6	Qwen3 Embedding 4B	Qwen	2560	0.88279	5.76s
7	Qwen3 Embedding 8B	Qwen	4096	0.86974	112.19s
8	Text Embedding 3 Large	OpenAI	3072	0.86778	28.40s
9	EmbeddingGemma	Google	768	0.82882	4.76s
10	Snowflake Arctic Embed L v2.0	Snowflake	1024	0.81129	1.16s
11	Qwen3 Embedding 0.6B	Qwen	1024	0.80901	4.29s

Metered Cost

Ingest

$0.0019

Query

$0.0019

Total

$0.0038

Meter	Usage	Rate	Cost
Ingest embeddings	11,236 tokens	$0.12 / M tokens	$0.0013
Ingest compute	3.968s	$0.52 / CPU-hour	$0.0006
Search embeddings	877 tokens	$0.04 / M tokens	$0.0000
Query compute	12.789s	$0.52 / CPU-hour	$0.0018

Storage is excluded. Costs use public TypeGraph metered rates: ingest embeddings at $0.12/M tokens, search embeddings at $0.04/M tokens, and compute at $0.52/CPU-hour.

Relevant Code

Create a bucket and ingest the corpus

Download the public MLEB corpus from Hugging Face, save it locally, and ingest it into a TypeGraph bucket with the same chunking settings.

seed-mleb.ts

import { readFile } from 'node:fs/promises'
import { typegraphInit } from '@typegraph-ai/sdk'

type CorpusRow = {
  _id: string
  title?: string
  text: string
  metadata?: Record<string, unknown>
}

const typegraph = await typegraphInit({
  apiKey: process.env.TYPEGRAPH_API_KEY!,
  tenantId: process.env.TYPEGRAPH_TENANT_ID!,
})

const corpus = JSON.parse(
  await readFile('./data/contractual-clause-retrieval/corpus.json', 'utf8'),
) as CorpusRow[]

const bucket = await typegraph.bucket.create({
  name: 'contractual-clause-retrieval',
  indexDefaults: {
    chunkSize: 2048,
    chunkOverlap: 256,
    graphExtraction: false,
    deduplicateBy: ['content'],
  },
})

const documents = corpus.map((row) => ({
  id: row._id,
  name: row.title ?? row._id,
  content: row.text,
  metadata: {
    ...(row.metadata ?? {}),
    corpusId: row._id,
    documentName: row.title,
  },
}))

await typegraph.document.ingest(documents, { bucketId: bucket.id })

console.log(`Bucket ready: ${bucket.id}`)

Run semantic retrieval over the queries

Query the bucket with semantic search only and retain the top-10 corpus IDs for BEIR-style scoring.

eval-semantic.ts

import { readFile } from 'node:fs/promises'
import { typegraphInit } from '@typegraph-ai/sdk'

type QueryRow = { _id: string; text: string }

const typegraph = await typegraphInit({
  apiKey: process.env.TYPEGRAPH_API_KEY!,
  tenantId: process.env.TYPEGRAPH_TENANT_ID!,
})

const bucketId = process.env.TYPEGRAPH_BUCKET_ID!
const queries = JSON.parse(
  await readFile('./data/contractual-clause-retrieval/queries.json', 'utf8'),
) as QueryRow[]

const run = new Map<string, string[]>()

for (const query of queries) {
  const response = await typegraph.search(query.text, {
    buckets: [bucketId],
    resources: ['documents'],
    weights: { semantic: 1, bm25: false, graph: false, recency: false },
    limit: 12,
  })

  run.set(
    query._id,
    response.results.chunks.map((chunk) =>
      String(chunk.metadata?.corpusId ?? chunk.document.name),
    ),
  )
}

// Score the deduplicated top 10 in `run` against qrels.json with nDCG@10, MAP@10, Recall@10, and P@10.

Minimal semantic query shape

For application code, the only TypeGraph-specific part is a semantic query against the benchmark bucket.

semantic-query.ts

const response = await typegraph.search(query.text, {
  buckets: [process.env.TYPEGRAPH_BUCKET_ID!],
  resources: ['documents'],
  weights: { semantic: 1, bm25: false, graph: false, recency: false },
  limit: 12,
})

const retrievedCorpusIds = response.results.chunks.map((chunk) =>
  String(chunk.metadata?.corpusId ?? chunk.document.name),
)

Metered cost formula

The page reports metered run cost from TypeGraph event usage, not a synthetic estimate from document count.

cost.ts

const ingestCost =
  (embedIngestTokens / 1_000_000) * 0.12 +
  (ingestComputeMs / 3_600_000) * 0.52

const queryCost =
  (embedQueryTokens / 1_000_000) * 0.04 +
  (queryComputeMs / 3_600_000) * 0.52

References

Isaacus MLEB leaderboard

Public leaderboard for the Massive Legal Embedding Benchmark.

Official MLEB results JSONL

Source for official comparison rows in the leaderboard table.

Contractual Clause dataset

Hugging Face dataset card for corpus, queries, and qrels.