ContractsMLEB contracts retrieval

Contractual Clause Retrieval Benchmark

Published April 28, 2026 · Updated May 13, 2026

TypeGraph found 88 of 90 relevant contract clauses in the top 10 with 0.9289 nDCG@10, after indexing the corpus in 1.73 seconds.

What this page shows

This is a TypeGraph Cloud semantic retrieval run on the public Contractual Clause Retrieval dataset from the Massive Legal Embedding Benchmark. Each query asks TypeGraph Cloud for the top 10 matching results, then the run is scored against the dataset's qrels with BEIR-style retrieval metrics.

TypeGraph found 88 of 90 relevant contract clauses in the top 10 with 0.9289 nDCG@10, after indexing the corpus in 1.73 seconds. The leaderboard table below keeps the official MLEB rows visible and inserts the TypeGraph Cloud run as a highlighted comparison row.

Semantic eval duration
6.33s
5.2x faster than the top official Contractual Clause leaderboard row shown

TypeGraph Cloud completed this bucket-scoped semantic retrieval eval in 6.33 seconds according to TypeGraph telemetry. The top official MLEB row shown, Voyage 4 Large, reports 33.16 seconds on the same dataset.

nDCG@10
0.9289
Ranking quality across all 45 scored queries
Recall@10
97.78%
88 of 90 gold clauses were retrieved
Eval Time
6.33s
Bucket-scoped semantic retrieval telemetry
Ingest Time
1.73s
Bucket-scoped indexing telemetry
Corpus
90 docs
90 indexed chunks after 2048-token chunking
Cost
$0.0038
Metered ingest plus query execution cost

Executive Summary

Contractual Clause Retrieval tests whether a system can map a query about contract language to the right clause-level evidence. This is the kind of lookup legal, procurement, and compliance teams need before a generation layer starts summarizing obligations.

The TypeGraph run used semantic retrieval only over a 90-document corpus. It found 88 of 90 gold clauses in the top 10, with 0.9289 nDCG@10 and a bucket-scoped evaluation runtime of 6.33 seconds.

Each query has two relevant clauses, so raw Precision@10 has a maximum possible value of 0.20. The reported 0.1956 is therefore close to the dataset ceiling and should be read alongside Recall@10 and gold found @10.

Benchmark Dataset

Queries describe contract clause requirements. The retriever must surface the two clause documents marked relevant by the dataset qrels.

PropertyValue
DatasetContractual Clause Retrieval
CategoryContracts
Corpus90 documents
Indexed chunks90 chunks
Queries45 queries
Qrels2 relevant clauses per query
Chunking2048 tokens, 256 overlap
Ingest time1.73s indexing telemetry
Eval time6.33s bucket retrieval
Ingest cost$0.0019
Eval cost$0.0019
Total cost$0.0038

Ingest and eval timings are bucket-scoped TypeGraph telemetry windows, excluding client-side dataset loading and benchmark runner wait time.

Methodology

  1. Loaded the Isaacus Contractual Clause Retrieval corpus and qrels from the MLEB BEIR-style dataset.
  2. Seeded a TypeGraph Cloud benchmark bucket with graph extraction disabled and chunking set to 2048 tokens with 256 overlap.
  3. Ran documents-only semantic retrieval. BM25, graph, and recency weights were disabled for this run.
  4. Requested 12 retrieval candidates per query and scored the deduplicated top 10 against the source qrels.
  5. Scored retrieved corpus IDs against source qrels using nDCG@10, MAP@10, Recall@10, and Precision@10.
  6. Calculated cost from TypeGraph metered events: ingest embedding tokens, search embedding tokens, and compute duration at the public metered rates.

Detailed Metrics Overview

Before we dive into the leaderboard, here's a quick overview of the metrics, TypeGraph Cloud's scores, and how to read them:
MetricTypeGraph ScoreHow to read it
nDCG@100.928859Primary ranking-quality metric used for MLEB leaderboard comparison.
MAP@100.895026Mean average precision across all scored queries.
Recall@100.97777888 of 90 relevant contract clauses appeared in the top 10.
Precision@100.195556Near the dataset cap of 0.20 because each query has two relevant clauses.
Queries run45All queries were scored.
WeightsSemantic onlysemantic=1; BM25, graph, and recency disabled.

How to read Precision@10 here

Precision@10 divides hits by 10 returned slots. Contractual Clause Retrieval has two gold clauses per query, so a perfect retrieval run would score 0.20 rather than 1.00 on P@10.

MLEB Leaderboard Comparison

Official rows come from Isaacus MLEB results.jsonl. The highlighted TypeGraph row is inserted for direct comparison using this page's semantic run.

RankModelProviderDimsnDCG@10Eval time
1TypeGraph CloudTypeGraph10240.928866.33s
2Voyage 4 LargeVoyage10240.9276533.16s
3Voyage 4Voyage10240.9146433.38s
4Kanon 2 EmbedderIsaacus17920.9095118.40s
5Voyage 4 LiteVoyage10240.8925615.76s
6Qwen3 Embedding 4BQwen25600.882795.76s
7Qwen3 Embedding 8BQwen40960.86974112.19s
8Text Embedding 3 LargeOpenAI30720.8677828.40s
9EmbeddingGemmaGoogle7680.828824.76s
10Snowflake Arctic Embed L v2.0Snowflake10240.811291.16s
11Qwen3 Embedding 0.6BQwen10240.809014.29s

Metered Cost

Ingest
$0.0019
Query
$0.0019
Total
$0.0038
MeterUsageRateCost
Ingest embeddings11,236 tokens$0.12 / M tokens$0.0013
Ingest compute3.968s$0.52 / CPU-hour$0.0006
Search embeddings877 tokens$0.04 / M tokens$0.0000
Query compute12.789s$0.52 / CPU-hour$0.0018

Storage is excluded. Costs use public TypeGraph metered rates: ingest embeddings at $0.12/M tokens, search embeddings at $0.04/M tokens, and compute at $0.52/CPU-hour.

Relevant Code

Create a bucket and ingest the corpus

Download the public MLEB corpus from Hugging Face, save it locally, and ingest it into a TypeGraph bucket with the same chunking settings.

seed-mleb.ts
import { readFile } from 'node:fs/promises'
import { typegraphInit } from '@typegraph-ai/sdk'

type CorpusRow = {
  _id: string
  title?: string
  text: string
  metadata?: Record<string, unknown>
}

const typegraph = await typegraphInit({
  apiKey: process.env.TYPEGRAPH_API_KEY!,
  tenantId: process.env.TYPEGRAPH_TENANT_ID!,
})

const corpus = JSON.parse(
  await readFile('./data/contractual-clause-retrieval/corpus.json', 'utf8'),
) as CorpusRow[]

const bucket = await typegraph.bucket.create({
  name: 'contractual-clause-retrieval',
  indexDefaults: {
    chunkSize: 2048,
    chunkOverlap: 256,
    graphExtraction: false,
    deduplicateBy: ['content'],
  },
})

const documents = corpus.map((row) => ({
  id: row._id,
  name: row.title ?? row._id,
  content: row.text,
  metadata: {
    ...(row.metadata ?? {}),
    corpusId: row._id,
    documentName: row.title,
  },
}))

await typegraph.document.ingest(documents, { bucketId: bucket.id })

console.log(`Bucket ready: ${bucket.id}`)

Run semantic retrieval over the queries

Query the bucket with semantic search only and retain the top-10 corpus IDs for BEIR-style scoring.

eval-semantic.ts
import { readFile } from 'node:fs/promises'
import { typegraphInit } from '@typegraph-ai/sdk'

type QueryRow = { _id: string; text: string }

const typegraph = await typegraphInit({
  apiKey: process.env.TYPEGRAPH_API_KEY!,
  tenantId: process.env.TYPEGRAPH_TENANT_ID!,
})

const bucketId = process.env.TYPEGRAPH_BUCKET_ID!
const queries = JSON.parse(
  await readFile('./data/contractual-clause-retrieval/queries.json', 'utf8'),
) as QueryRow[]

const run = new Map<string, string[]>()

for (const query of queries) {
  const response = await typegraph.search(query.text, {
    buckets: [bucketId],
    resources: ['documents'],
    weights: { semantic: 1, bm25: false, graph: false, recency: false },
    limit: 12,
  })

  run.set(
    query._id,
    response.results.chunks.map((chunk) =>
      String(chunk.metadata?.corpusId ?? chunk.document.name),
    ),
  )
}

// Score the deduplicated top 10 in `run` against qrels.json with nDCG@10, MAP@10, Recall@10, and P@10.

Minimal semantic query shape

For application code, the only TypeGraph-specific part is a semantic query against the benchmark bucket.

semantic-query.ts
const response = await typegraph.search(query.text, {
  buckets: [process.env.TYPEGRAPH_BUCKET_ID!],
  resources: ['documents'],
  weights: { semantic: 1, bm25: false, graph: false, recency: false },
  limit: 12,
})

const retrievedCorpusIds = response.results.chunks.map((chunk) =>
  String(chunk.metadata?.corpusId ?? chunk.document.name),
)

Metered cost formula

The page reports metered run cost from TypeGraph event usage, not a synthetic estimate from document count.

cost.ts
const ingestCost =
  (embedIngestTokens / 1_000_000) * 0.12 +
  (ingestComputeMs / 3_600_000) * 0.52

const queryCost =
  (embedQueryTokens / 1_000_000) * 0.04 +
  (queryComputeMs / 3_600_000) * 0.52

References

Related TypeGraph Reading

Contractual Clause Retrieval Benchmark

Contractual Clause Retrieval Benchmark | TypeGraph