ContractsMLEB contracts retrieval

License TL;DR Retrieval Benchmark

Published April 28, 2026 · Updated May 13, 2026

TypeGraph found 61 of 65 relevant software license summaries in the top 10 with 0.8066 nDCG@10, running the full semantic eval in 10.38 seconds.

What this page shows

This is a TypeGraph Cloud semantic retrieval run on the public License TL;DR Retrieval dataset from the Massive Legal Embedding Benchmark. Each query asks TypeGraph Cloud for the top 10 matching results, then the run is scored against the dataset's qrels with BEIR-style retrieval metrics.

TypeGraph found 61 of 65 relevant software license summaries in the top 10 with 0.8066 nDCG@10, running the full semantic eval in 10.38 seconds. The leaderboard table below keeps the official MLEB rows visible and inserts the TypeGraph Cloud run as a highlighted comparison row.

Semantic eval duration
10.38s
3.5x faster than the top official License TL;DR leaderboard row shown

TypeGraph Cloud completed this bucket-scoped semantic retrieval eval in 10.38 seconds according to TypeGraph telemetry. The top official MLEB row shown, Voyage 4 Large, reports 36.84 seconds on the same dataset; treat this as leaderboard-reported runtime context rather than a controlled hardware benchmark.

nDCG@10
0.8066
Ranking quality across all 65 scored queries
Recall@10
93.85%
61 of 65 gold license documents were retrieved
Eval Time
10.38s
Bucket-scoped semantic retrieval telemetry
Ingest Time
3.76s
Combined final seed/top-up indexing telemetry
Corpus
65 docs
115 indexed chunks after 2048-token chunking
Cost
$0.0194
Metered ingest plus query execution cost

Executive Summary

License TL;DR retrieval tests whether a RAG system can connect short natural-language license obligations to the correct software license document. The task is compact, but it is easy to misread because each query has exactly one relevant document.

The TypeGraph run used semantic retrieval only, 2048-token chunks, a top-10 scoring cutoff, and BEIR-style ranking metrics. The important product signal is that 61 of 65 gold documents appeared in the first 10 scored results, with 0.8066 nDCG@10 over the full query set.

Precision@10 is included for completeness, but it should not be used as the headline metric for this dataset. Because there is only one gold document per query, the maximum possible P@10 is 0.10 even for a perfect run.

Benchmark Dataset

Queries are plain-language license summaries and obligations. The retriever must surface the one license document that matches each summary.

PropertyValue
DatasetLicense TL;DR Retrieval
CategoryContracts
Corpus65 documents
Indexed chunks115 chunks
Queries65 queries
Qrels1 relevant document per query
Chunking2048 tokens, 256 overlap
Ingest time3.76s indexing telemetry
Eval time10.38s bucket retrieval
Ingest cost$0.0168
Eval cost$0.0026
Total cost$0.0194

Ingest and eval timings are bucket-scoped TypeGraph telemetry windows, excluding client-side dataset loading and benchmark runner wait time.

Methodology

  1. Loaded the Isaacus License TL;DR corpus and qrels from the MLEB BEIR-style dataset.
  2. Seeded a TypeGraph Cloud benchmark bucket with graph extraction disabled and chunking set to 2048 tokens with 256 overlap.
  3. Ran documents-only semantic retrieval. BM25, graph, and recency weights were disabled for this run.
  4. Requested 12 retrieval candidates per query and scored the deduplicated top 10 against the source qrels.
  5. Scored retrieved corpus IDs against source qrels using nDCG@10, MAP@10, Recall@10, and Precision@10.
  6. Calculated cost from TypeGraph metered events: ingest embedding tokens, search embedding tokens, and compute duration at the public metered rates.

Detailed Metrics Overview

Before we dive into the leaderboard, here's a quick overview of the metrics, TypeGraph Cloud's scores, and how to read them:
MetricTypeGraph ScoreHow to read it
nDCG@100.806641Primary ranking-quality metric used for MLEB leaderboard comparison.
MAP@100.764652Mean average precision across all scored queries.
Recall@100.93846261 of 65 relevant license documents appeared in the top 10.
Precision@100.093846Near the dataset cap of 0.10 because each query has one relevant document.
Queries run65All queries were scored.
WeightsSemantic onlysemantic=1; BM25, graph, and recency disabled.

How to read Precision@10 here

Precision@10 divides hits by 10 returned slots. License TL;DR has one gold document per query, so a perfect retrieval run would score 0.10 rather than 1.00 on P@10.

MLEB Leaderboard Comparison

Official rows come from Isaacus MLEB results.jsonl. The highlighted TypeGraph row is inserted for direct comparison using this page's semantic run.

RankModelProviderDimsnDCG@10Eval time
1Voyage 4 LargeVoyage10240.8142036.84s
2TypeGraph CloudTypeGraph10240.8066410.38s
3Voyage 4Voyage10240.7760240.05s
4Qwen3 Embedding 4BQwen25600.7643012.92s
5Kanon 2 EmbedderIsaacus17920.7461028.48s
6Qwen3 Embedding 8BQwen40960.73280401.92s
7Voyage 4 LiteVoyage10240.7181721.23s
8Jina Embeddings v5 Text SmallJina10240.709854.58s
9Gemini Embedding 001Google30720.6908142.21s
10Jina Embeddings v5 Text NanoJina7680.675711.13s
11Text Embedding 3 LargeOpenAI30720.6668441.50s

Metered Cost

Ingest
$0.0168
Query
$0.0026
Total
$0.0194
MeterUsageRateCost
Ingest embeddings135,446 tokens$0.12 / M tokens$0.0163
Ingest compute4.063s$0.52 / CPU-hour$0.0006
Search embeddings3,660 tokens$0.04 / M tokens$0.0001
Query compute16.720s$0.52 / CPU-hour$0.0024

Storage is excluded. Costs use public TypeGraph metered rates: ingest embeddings at $0.12/M tokens, search embeddings at $0.04/M tokens, and compute at $0.52/CPU-hour.

Relevant Code

Create a bucket and ingest the corpus

Download the public MLEB corpus from Hugging Face, save it locally, and ingest it into a TypeGraph bucket with the same chunking settings.

seed-mleb.ts
import { readFile } from 'node:fs/promises'
import { typegraphInit } from '@typegraph-ai/sdk'

type CorpusRow = {
  _id: string
  title?: string
  text: string
  metadata?: Record<string, unknown>
}

const typegraph = await typegraphInit({
  apiKey: process.env.TYPEGRAPH_API_KEY!,
  tenantId: process.env.TYPEGRAPH_TENANT_ID!,
})

const corpus = JSON.parse(
  await readFile('./data/license-tldr-retrieval/corpus.json', 'utf8'),
) as CorpusRow[]

const bucket = await typegraph.bucket.create({
  name: 'license-tldr-retrieval',
  indexDefaults: {
    chunkSize: 2048,
    chunkOverlap: 256,
    graphExtraction: false,
    deduplicateBy: ['content'],
  },
})

const documents = corpus.map((row) => ({
  id: row._id,
  name: row.title ?? row._id,
  content: row.text,
  metadata: {
    ...(row.metadata ?? {}),
    corpusId: row._id,
    documentName: row.title,
  },
}))

await typegraph.document.ingest(documents, { bucketId: bucket.id })

console.log(`Bucket ready: ${bucket.id}`)

Run semantic retrieval over the queries

Query the bucket with semantic search only and retain the top-10 corpus IDs for BEIR-style scoring.

eval-semantic.ts
import { readFile } from 'node:fs/promises'
import { typegraphInit } from '@typegraph-ai/sdk'

type QueryRow = { _id: string; text: string }

const typegraph = await typegraphInit({
  apiKey: process.env.TYPEGRAPH_API_KEY!,
  tenantId: process.env.TYPEGRAPH_TENANT_ID!,
})

const bucketId = process.env.TYPEGRAPH_BUCKET_ID!
const queries = JSON.parse(
  await readFile('./data/license-tldr-retrieval/queries.json', 'utf8'),
) as QueryRow[]

const run = new Map<string, string[]>()

for (const query of queries) {
  const response = await typegraph.search(query.text, {
    buckets: [bucketId],
    resources: ['documents'],
    weights: { semantic: 1, bm25: false, graph: false, recency: false },
    limit: 12,
  })

  run.set(
    query._id,
    response.results.chunks.map((chunk) =>
      String(chunk.metadata?.corpusId ?? chunk.document.name),
    ),
  )
}

// Score the deduplicated top 10 in `run` against qrels.json with nDCG@10, MAP@10, Recall@10, and P@10.

Minimal semantic query shape

For application code, the only TypeGraph-specific part is a semantic query against the benchmark bucket.

semantic-query.ts
const response = await typegraph.search(query.text, {
  buckets: [process.env.TYPEGRAPH_BUCKET_ID!],
  resources: ['documents'],
  weights: { semantic: 1, bm25: false, graph: false, recency: false },
  limit: 12,
})

const retrievedCorpusIds = response.results.chunks.map((chunk) =>
  String(chunk.metadata?.corpusId ?? chunk.document.name),
)

Metered cost formula

The page reports metered run cost from TypeGraph event usage, not a synthetic estimate from document count.

cost.ts
const ingestCost =
  (embedIngestTokens / 1_000_000) * 0.12 +
  (ingestComputeMs / 3_600_000) * 0.52

const queryCost =
  (embedQueryTokens / 1_000_000) * 0.04 +
  (queryComputeMs / 3_600_000) * 0.52

References

Related TypeGraph Reading

License TL;DR Retrieval Benchmark

License TL;DR Retrieval Benchmark | TypeGraph