ContractsMLEB contracts retrieval

License TL;DR Retrieval Benchmark

Published April 28, 2026 · Updated May 13, 2026

TypeGraph found 61 of 65 relevant software license summaries in the top 10 with 0.8066 nDCG@10, running the full semantic eval in 10.38 seconds.

What this page shows

This is a TypeGraph Cloud semantic retrieval run on the public License TL;DR Retrieval dataset from the Massive Legal Embedding Benchmark. Each query asks TypeGraph Cloud for the top 10 matching results, then the run is scored against the dataset's qrels with BEIR-style retrieval metrics.

TypeGraph found 61 of 65 relevant software license summaries in the top 10 with 0.8066 nDCG@10, running the full semantic eval in 10.38 seconds. The leaderboard table below keeps the official MLEB rows visible and inserts the TypeGraph Cloud run as a highlighted comparison row.

Semantic eval duration

10.38s

3.5x faster than the top official License TL;DR leaderboard row shown

TypeGraph Cloud completed this bucket-scoped semantic retrieval eval in 10.38 seconds according to TypeGraph telemetry. The top official MLEB row shown, Voyage 4 Large, reports 36.84 seconds on the same dataset; treat this as leaderboard-reported runtime context rather than a controlled hardware benchmark.

nDCG@10

0.8066

Ranking quality across all 65 scored queries

Recall@10

93.85%

61 of 65 gold license documents were retrieved

Eval Time

10.38s

Bucket-scoped semantic retrieval telemetry

Ingest Time

3.76s

Combined final seed/top-up indexing telemetry

Corpus

65 docs

115 indexed chunks after 2048-token chunking

Cost

$0.0194

Metered ingest plus query execution cost

Executive Summary

License TL;DR retrieval tests whether a RAG system can connect short natural-language license obligations to the correct software license document. The task is compact, but it is easy to misread because each query has exactly one relevant document.

The TypeGraph run used semantic retrieval only, 2048-token chunks, a top-10 scoring cutoff, and BEIR-style ranking metrics. The important product signal is that 61 of 65 gold documents appeared in the first 10 scored results, with 0.8066 nDCG@10 over the full query set.

Precision@10 is included for completeness, but it should not be used as the headline metric for this dataset. Because there is only one gold document per query, the maximum possible P@10 is 0.10 even for a perfect run.

Benchmark Dataset

Queries are plain-language license summaries and obligations. The retriever must surface the one license document that matches each summary.

Property	Value
Dataset	License TL;DR Retrieval
Category	Contracts
Corpus	65 documents
Indexed chunks	115 chunks
Queries	65 queries
Qrels	1 relevant document per query
Chunking	2048 tokens, 256 overlap
Ingest time	3.76s indexing telemetry
Eval time	10.38s bucket retrieval
Ingest cost	$0.0168
Eval cost	$0.0026
Total cost	$0.0194

Ingest and eval timings are bucket-scoped TypeGraph telemetry windows, excluding client-side dataset loading and benchmark runner wait time.

Methodology

Loaded the Isaacus License TL;DR corpus and qrels from the MLEB BEIR-style dataset.
Seeded a TypeGraph Cloud benchmark bucket with graph extraction disabled and chunking set to 2048 tokens with 256 overlap.
Ran documents-only semantic retrieval. BM25, graph, and recency weights were disabled for this run.
Requested 12 retrieval candidates per query and scored the deduplicated top 10 against the source qrels.
Scored retrieved corpus IDs against source qrels using nDCG@10, MAP@10, Recall@10, and Precision@10.
Calculated cost from TypeGraph metered events: ingest embedding tokens, search embedding tokens, and compute duration at the public metered rates.

Detailed Metrics Overview

Before we dive into the leaderboard, here's a quick overview of the metrics, TypeGraph Cloud's scores, and how to read them:

Metric	TypeGraph Score	How to read it
nDCG@10	0.806641	Primary ranking-quality metric used for MLEB leaderboard comparison.
MAP@10	0.764652	Mean average precision across all scored queries.
Recall@10	0.938462	61 of 65 relevant license documents appeared in the top 10.
Precision@10	0.093846	Near the dataset cap of 0.10 because each query has one relevant document.
Queries run	65	All queries were scored.
Weights	Semantic only	semantic=1; BM25, graph, and recency disabled.

How to read Precision@10 here

Precision@10 divides hits by 10 returned slots. License TL;DR has one gold document per query, so a perfect retrieval run would score 0.10 rather than 1.00 on P@10.

MLEB Leaderboard Comparison

Official rows come from Isaacus MLEB results.jsonl. The highlighted TypeGraph row is inserted for direct comparison using this page's semantic run.

Rank	Model	Provider	Dims	nDCG@10	Eval time
1	Voyage 4 Large	Voyage	1024	0.81420	36.84s
2	TypeGraph Cloud	TypeGraph	1024	0.80664	10.38s
3	Voyage 4	Voyage	1024	0.77602	40.05s
4	Qwen3 Embedding 4B	Qwen	2560	0.76430	12.92s
5	Kanon 2 Embedder	Isaacus	1792	0.74610	28.48s
6	Qwen3 Embedding 8B	Qwen	4096	0.73280	401.92s
7	Voyage 4 Lite	Voyage	1024	0.71817	21.23s
8	Jina Embeddings v5 Text Small	Jina	1024	0.70985	4.58s
9	Gemini Embedding 001	Google	3072	0.69081	42.21s
10	Jina Embeddings v5 Text Nano	Jina	768	0.67571	1.13s
11	Text Embedding 3 Large	OpenAI	3072	0.66684	41.50s

Metered Cost

Ingest

$0.0168

Query

$0.0026

Total

$0.0194

Meter	Usage	Rate	Cost
Ingest embeddings	135,446 tokens	$0.12 / M tokens	$0.0163
Ingest compute	4.063s	$0.52 / CPU-hour	$0.0006
Search embeddings	3,660 tokens	$0.04 / M tokens	$0.0001
Query compute	16.720s	$0.52 / CPU-hour	$0.0024

Storage is excluded. Costs use public TypeGraph metered rates: ingest embeddings at $0.12/M tokens, search embeddings at $0.04/M tokens, and compute at $0.52/CPU-hour.

Relevant Code

Create a bucket and ingest the corpus

Download the public MLEB corpus from Hugging Face, save it locally, and ingest it into a TypeGraph bucket with the same chunking settings.

seed-mleb.ts

import { readFile } from 'node:fs/promises'
import { typegraphInit } from '@typegraph-ai/sdk'

type CorpusRow = {
  _id: string
  title?: string
  text: string
  metadata?: Record<string, unknown>
}

const typegraph = await typegraphInit({
  apiKey: process.env.TYPEGRAPH_API_KEY!,
  tenantId: process.env.TYPEGRAPH_TENANT_ID!,
})

const corpus = JSON.parse(
  await readFile('./data/license-tldr-retrieval/corpus.json', 'utf8'),
) as CorpusRow[]

const bucket = await typegraph.bucket.create({
  name: 'license-tldr-retrieval',
  indexDefaults: {
    chunkSize: 2048,
    chunkOverlap: 256,
    graphExtraction: false,
    deduplicateBy: ['content'],
  },
})

const documents = corpus.map((row) => ({
  id: row._id,
  name: row.title ?? row._id,
  content: row.text,
  metadata: {
    ...(row.metadata ?? {}),
    corpusId: row._id,
    documentName: row.title,
  },
}))

await typegraph.document.ingest(documents, { bucketId: bucket.id })

console.log(`Bucket ready: ${bucket.id}`)

Run semantic retrieval over the queries

Query the bucket with semantic search only and retain the top-10 corpus IDs for BEIR-style scoring.

eval-semantic.ts

import { readFile } from 'node:fs/promises'
import { typegraphInit } from '@typegraph-ai/sdk'

type QueryRow = { _id: string; text: string }

const typegraph = await typegraphInit({
  apiKey: process.env.TYPEGRAPH_API_KEY!,
  tenantId: process.env.TYPEGRAPH_TENANT_ID!,
})

const bucketId = process.env.TYPEGRAPH_BUCKET_ID!
const queries = JSON.parse(
  await readFile('./data/license-tldr-retrieval/queries.json', 'utf8'),
) as QueryRow[]

const run = new Map<string, string[]>()

for (const query of queries) {
  const response = await typegraph.search(query.text, {
    buckets: [bucketId],
    resources: ['documents'],
    weights: { semantic: 1, bm25: false, graph: false, recency: false },
    limit: 12,
  })

  run.set(
    query._id,
    response.results.chunks.map((chunk) =>
      String(chunk.metadata?.corpusId ?? chunk.document.name),
    ),
  )
}

// Score the deduplicated top 10 in `run` against qrels.json with nDCG@10, MAP@10, Recall@10, and P@10.

Minimal semantic query shape

For application code, the only TypeGraph-specific part is a semantic query against the benchmark bucket.

semantic-query.ts

const response = await typegraph.search(query.text, {
  buckets: [process.env.TYPEGRAPH_BUCKET_ID!],
  resources: ['documents'],
  weights: { semantic: 1, bm25: false, graph: false, recency: false },
  limit: 12,
})

const retrievedCorpusIds = response.results.chunks.map((chunk) =>
  String(chunk.metadata?.corpusId ?? chunk.document.name),
)

Metered cost formula

The page reports metered run cost from TypeGraph event usage, not a synthetic estimate from document count.

cost.ts

const ingestCost =
  (embedIngestTokens / 1_000_000) * 0.12 +
  (ingestComputeMs / 3_600_000) * 0.52

const queryCost =
  (embedQueryTokens / 1_000_000) * 0.04 +
  (queryComputeMs / 3_600_000) * 0.52

References

Isaacus MLEB leaderboard

Public leaderboard for the Massive Legal Embedding Benchmark.

Official MLEB results JSONL

Source for official comparison rows in the leaderboard table.

License TL;DR dataset

Hugging Face dataset card for corpus, queries, and qrels.