License TL;DR Retrieval Benchmark
TypeGraph found 61 of 65 relevant software license summaries in the top 10 with 0.8066 nDCG@10, running the full semantic eval in 10.38 seconds.
This is a TypeGraph Cloud semantic retrieval run on the public License TL;DR Retrieval dataset from the Massive Legal Embedding Benchmark. Each query asks TypeGraph Cloud for the top 10 matching results, then the run is scored against the dataset's qrels with BEIR-style retrieval metrics.
TypeGraph found 61 of 65 relevant software license summaries in the top 10 with 0.8066 nDCG@10, running the full semantic eval in 10.38 seconds. The leaderboard table below keeps the official MLEB rows visible and inserts the TypeGraph Cloud run as a highlighted comparison row.
TypeGraph Cloud completed this bucket-scoped semantic retrieval eval in 10.38 seconds according to TypeGraph telemetry. The top official MLEB row shown, Voyage 4 Large, reports 36.84 seconds on the same dataset; treat this as leaderboard-reported runtime context rather than a controlled hardware benchmark.
Executive Summary
License TL;DR retrieval tests whether a RAG system can connect short natural-language license obligations to the correct software license document. The task is compact, but it is easy to misread because each query has exactly one relevant document.
The TypeGraph run used semantic retrieval only, 2048-token chunks, a top-10 scoring cutoff, and BEIR-style ranking metrics. The important product signal is that 61 of 65 gold documents appeared in the first 10 scored results, with 0.8066 nDCG@10 over the full query set.
Precision@10 is included for completeness, but it should not be used as the headline metric for this dataset. Because there is only one gold document per query, the maximum possible P@10 is 0.10 even for a perfect run.
Benchmark Dataset
Queries are plain-language license summaries and obligations. The retriever must surface the one license document that matches each summary.
| Property | Value |
|---|---|
| Dataset | License TL;DR Retrieval |
| Category | Contracts |
| Corpus | 65 documents |
| Indexed chunks | 115 chunks |
| Queries | 65 queries |
| Qrels | 1 relevant document per query |
| Chunking | 2048 tokens, 256 overlap |
| Ingest time | 3.76s indexing telemetry |
| Eval time | 10.38s bucket retrieval |
| Ingest cost | $0.0168 |
| Eval cost | $0.0026 |
| Total cost | $0.0194 |
Ingest and eval timings are bucket-scoped TypeGraph telemetry windows, excluding client-side dataset loading and benchmark runner wait time.
Methodology
- Loaded the Isaacus License TL;DR corpus and qrels from the MLEB BEIR-style dataset.
- Seeded a TypeGraph Cloud benchmark bucket with graph extraction disabled and chunking set to 2048 tokens with 256 overlap.
- Ran documents-only semantic retrieval. BM25, graph, and recency weights were disabled for this run.
- Requested 12 retrieval candidates per query and scored the deduplicated top 10 against the source qrels.
- Scored retrieved corpus IDs against source qrels using nDCG@10, MAP@10, Recall@10, and Precision@10.
- Calculated cost from TypeGraph metered events: ingest embedding tokens, search embedding tokens, and compute duration at the public metered rates.
Detailed Metrics Overview
| Metric | TypeGraph Score | How to read it |
|---|---|---|
| nDCG@10 | 0.806641 | Primary ranking-quality metric used for MLEB leaderboard comparison. |
| MAP@10 | 0.764652 | Mean average precision across all scored queries. |
| Recall@10 | 0.938462 | 61 of 65 relevant license documents appeared in the top 10. |
| Precision@10 | 0.093846 | Near the dataset cap of 0.10 because each query has one relevant document. |
| Queries run | 65 | All queries were scored. |
| Weights | Semantic only | semantic=1; BM25, graph, and recency disabled. |
How to read Precision@10 here
Precision@10 divides hits by 10 returned slots. License TL;DR has one gold document per query, so a perfect retrieval run would score 0.10 rather than 1.00 on P@10.
MLEB Leaderboard Comparison
Official rows come from Isaacus MLEB results.jsonl. The highlighted TypeGraph row is inserted for direct comparison using this page's semantic run.
| Rank | Model | Provider | Dims | nDCG@10 | Eval time |
|---|---|---|---|---|---|
| 1 | Voyage 4 Large | Voyage | 1024 | 0.81420 | 36.84s |
| 2 | TypeGraph Cloud | TypeGraph | 1024 | 0.80664 | 10.38s |
| 3 | Voyage 4 | Voyage | 1024 | 0.77602 | 40.05s |
| 4 | Qwen3 Embedding 4B | Qwen | 2560 | 0.76430 | 12.92s |
| 5 | Kanon 2 Embedder | Isaacus | 1792 | 0.74610 | 28.48s |
| 6 | Qwen3 Embedding 8B | Qwen | 4096 | 0.73280 | 401.92s |
| 7 | Voyage 4 Lite | Voyage | 1024 | 0.71817 | 21.23s |
| 8 | Jina Embeddings v5 Text Small | Jina | 1024 | 0.70985 | 4.58s |
| 9 | Gemini Embedding 001 | 3072 | 0.69081 | 42.21s | |
| 10 | Jina Embeddings v5 Text Nano | Jina | 768 | 0.67571 | 1.13s |
| 11 | Text Embedding 3 Large | OpenAI | 3072 | 0.66684 | 41.50s |
Metered Cost
| Meter | Usage | Rate | Cost |
|---|---|---|---|
| Ingest embeddings | 135,446 tokens | $0.12 / M tokens | $0.0163 |
| Ingest compute | 4.063s | $0.52 / CPU-hour | $0.0006 |
| Search embeddings | 3,660 tokens | $0.04 / M tokens | $0.0001 |
| Query compute | 16.720s | $0.52 / CPU-hour | $0.0024 |
Storage is excluded. Costs use public TypeGraph metered rates: ingest embeddings at $0.12/M tokens, search embeddings at $0.04/M tokens, and compute at $0.52/CPU-hour.
Relevant Code
Create a bucket and ingest the corpus
Download the public MLEB corpus from Hugging Face, save it locally, and ingest it into a TypeGraph bucket with the same chunking settings.
Run semantic retrieval over the queries
Query the bucket with semantic search only and retain the top-10 corpus IDs for BEIR-style scoring.
Minimal semantic query shape
For application code, the only TypeGraph-specific part is a semantic query against the benchmark bucket.
Metered cost formula
The page reports metered run cost from TypeGraph event usage, not a synthetic estimate from document count.
References
Related TypeGraph Reading
Set up semantic document retrieval in TypeGraph.
Understand weights, scoring, and top-k retrieval.
Ingest and manage documents in buckets.
Search documents, events, threads, entities, and facts with semantic, BM25, graph, and recency weights.
Configure embedding models for ingestion and query.
How to design retrieval and answer-quality evaluations.
How to read recall, precision, nDCG, MAP, and MRR.
Compare against another MLEB contracts retrieval task.
Review a larger legal retrieval run with TypeGraph content deduplication.
License TL;DR Retrieval Benchmark