GraphRAGGraphRAG-Bench Medical

GraphRAG-Bench Medical Answer Accuracy

Published May 1, 2026 · Updated May 1, 2026

TypeGraph scored 0.6768 ACC on all 2,062 GraphRAG-Bench Medical questions with semantic, BM25, and graph retrieval; observed search latency was 290ms p50 and 1.39s p95.

What this page shows

This is a TypeGraph Cloud answer-quality run on GraphRAG-Bench Medical, a benchmark that tests generated answers over medical and healthcare source material rather than only checking whether retrieval returned a known document ID.

The run used semantic, BM25, and graph retrieval, passed the SDK-native markdown prompt directly into a single tuned answer prompt, and scored answers with the GraphRAG-Bench LLM-as-judge ACC calculation.

Query latency

290ms p50

1.39s p95 end-to-end TypeGraph query latency

Latency is measured across the benchmark retrieval requests. Answer generation and judge calls are not included in these TypeGraph query latency percentiles.

Average ACC

0.6768

Answer correctness across 2,061 scored GraphRAG-Bench Medical questions

Total Benchmark Time

1h 31m 02s

Total corpus ingest and eval retrieval wall-clock time for the benchmark run

Total TypeGraph Cost

$6.55

TypeGraph-metered ingest plus retrieval cost; answer generation and judge calls excluded

Executive Summary

GraphRAG-Bench Medical is split across direct fact retrieval, complex reasoning, contextual summarization, and creative generation questions. The overall score is answer correctness across the benchmark, not a retrieval-only metric.

TypeGraph scored 0.6768 ACC overall. The strongest categories were Fact Retrieval at 0.7250 ACC and Complex Reasoning at 0.7246 ACC, with Contextual Summarize close behind at 0.6656 ACC.

Creative Generation remains the hardest category in this run at 0.2312 ACC. The faithfulness score was 0.5068 and coverage was 0.4265, which suggests responses often stayed partially grounded but did not fully satisfy the requested creative form and evidence coverage.

Benchmark Dataset

The Medical split contains healthcare and guideline-style source material with generated questions that exercise fact lookup, multi-hop reasoning, summarization, and creative generation grounded in retrieved context.

Property	Value
Dataset	GraphRAG-Bench Medical
Category	Answer-quality GraphRAG benchmark
Corpus	249 source documents
Indexed chunks	Indexed with 512-token chunks and 64-token overlap
Queries	2,062 questions
Qrels	Gold answers and question-type labels
Chunking	512 tokens, 64 overlap
Ingest time	6m 36s

Ingest time covers corpus indexing, chunking, graph extraction, and retrieval index construction.

Methodology

Loaded the GraphRAG-Bench Medical queries and gold answers from the benchmark dataset.
Searched the indexed TypeGraph corpus with semantic, BM25, and graph weights enabled.
Requested SDK-native markdown context with chunk and fact sections and passed response.prompt directly into answer generation.
Generated answers with openai/gpt-4o-mini.
Scored answers with the GraphRAG-Bench ACC method: LLM-judged factuality plus embedding-based semantic similarity.

Detailed Metrics Overview

Before we dive into the leaderboard, here's a quick overview of the metrics, TypeGraph Cloud's scores, and how to read them:

Metric	TypeGraph Score	How to read it
Overall ACC	0.676777	Primary GraphRAG-Bench answer-quality score across 2,061 scored questions. Judges if the answer is factually equivalent to the gold answer.
Overall ROUGE-L	0.391737	Text overlap with the gold answer; useful but can underrate good paraphrases.
Fact Retrieval ACC	0.724956	1,097 direct fact questions. Did you return the correct specific medical fact, name, risk factor, treatment, or short answer?
Complex Reasoning ACC	0.724637	509 reasoning questions. Did you correctly chain multiple medical facts together and answer the conclusion?
Contextual Summarize ACC	0.665567	289 summarization questions with coverage judging. Does the response cover the requested clinical concepts and relationships without drifting?
Creative Generation ACC	0.231160	166 creative questions with faithfulness and coverage judging. Does the response stay faithful to the source while satisfying the requested creative form?

How to read GraphRAG-Bench ACC

GraphRAG-Bench ACC is a continuous answer-quality score from 0 to 1. It is not exact match and it is not a BEIR retrieval metric like nDCG@10. The benchmark decomposes generated and gold answers into statements, judges factual overlap, and blends that with semantic similarity.

GraphRAG-Bench Medical Leaderboard Comparison

Published comparison rows come from the official GraphRAG-Bench Medical leaderboard values. The highlighted TypeGraph row is inserted on the same percentage scale.

Rank	System	Avg ACC	Fact Retrieval		Complex Reasoning		Contextual Summarize		Creative Generation
Rank	System	Avg ACC	ACC	ROUGE-L	ACC	ROUGE-L	ACC	Cov	ACC	FS	Cov
1	G-reasoner	73.30%	68.84	44.73	75.17	29.10	77.23	60.64	72.04	53.65	48.31
2	TypeGraph Cloud	67.68%	72.50	46.10	72.46	32.67	66.56	59.70	23.12	50.68	42.65
3	AutoPrunedRetriever-llm	67.00%	61.25	34.69	71.59	31.11	70.14	40.59	65.02	33.06	28.62
4	HippoRAG2	64.85%	66.28	36.69	61.98	36.97	63.08	46.13	68.05	58.78	51.54
5	Fast-GraphRAG	64.12%	60.93	31.04	61.73	21.37	67.88	52.07	65.93	56.07	44.73
6	LightRAG	62.59%	63.32	37.19	61.32	24.98	63.14	51.16	67.91	78.76	51.58
7	RAG (w rerank)	62.43%	64.73	30.75	58.64	15.57	65.75	78.54	60.61	36.74	58.72
8	RAG (w/o rerank)	61.00%	63.72	29.21	57.61	13.98	63.72	77.34	58.94	35.88	57.87
9	HippoRAG	59.08%	56.14	20.95	55.87	13.57	59.86	62.73	64.43	69.21	65.56
10	StructRAG	58.56%	55.38	27.53	56.17	22.79	62.48	65.66	60.21	42.35	45.76
11	RAPTOR	57.10%	54.07	17.93	53.20	11.73	58.73	78.28	62.38	58.98	63.63
12	Lazy-GraphRAG	56.89%	60.25	31.66	47.82	22.68	57.28	55.92	62.22	30.95	43.79
13	KGP	56.33%	55.53	21.34	51.53	11.69	54.51	62.40	63.77	45.25	35.55
14	KET-RAG	47.05%	60.35	31.99	39.56	19.52	45.27	29.04	43.04	33.67	31.93
15	MS-GraphRAG (local)	45.16%	38.63	26.80	47.04	21.99	41.87	22.98	53.11	32.65	39.42
16	MS-GraphRAG (global)	28.56%	16.42	46.00	15.61	52.75	19.82	-	20.81	-	13.64

Graph Footprint

Metric	Value	How to read it
Documents	249	Source documents indexed for the benchmark.
Document groups	1	One corpus principal for the Medical corpus, used to scope benchmark queries.
Chunks / passage nodes	745	Indexed chunks at 512-token chunking with 64-token overlap; graph passage nodes mirror the chunks.
Semantic entities / graph nodes	1,262	Resolved graph entities extracted from the medical corpus.
Semantic edges	739	Stored relationship edges between semantic entities.
Fact records	739	Evidence-backed fact records used by graph retrieval and answer context assembly.
Entity chunk mentions	9,649	Entity mention rows linking extracted entities back to chunks.
Passage entity edges	4,019	Edges between passage nodes and entities for graph-anchored retrieval.

Metered Cost

Ingest

$6.48

Eval

$0.074

Total

$6.55

Meter	Usage	Rate	Cost
Ingest embeddings	377,143 tokens	$0.12 / M tokens	$0.0453
Ingest LLM input	5,168,132 tokens	$1.00 / M tokens	$5.17
Ingest LLM output	234,583 tokens	$3.00 / M tokens	$0.70
Ingest compute	3,868,786 ms	$0.52 / CPU-hour	$0.56
Eval search embeddings	35,251 tokens	$0.04 / M tokens	$0.0014
Eval retrieval compute	502,611 ms	$0.52 / CPU-hour	$0.0726

Storage, answer generation, and judge calls are excluded. Costs use TypeGraph metered usage only: ingest embeddings at $0.12/M tokens, search embeddings at $0.04/M tokens, LLM input at $1.00/M tokens, LLM output at $3.00/M tokens, and compute at $0.52/CPU-hour.

Relevant Code

Create a graph-enabled bucket

Create a bucket with stable chunking and graph extraction enabled. Tenant isolation comes from the client tenantId; benchmark corpus separation is handled by bucket and graph selection.

import { typegraphInit } from '@typegraph-ai/sdk'

const typegraph = await typegraphInit({
  apiKey: process.env.TYPEGRAPH_API_KEY!,
  tenantId: process.env.TYPEGRAPH_TENANT_ID!,
})

const bucket = await typegraph.bucket.create({
  name: 'graphrag-bench-medical',
  indexDefaults: {
    chunkSize: 512,
    chunkOverlap: 64,
    graphExtraction: true,
    deduplicateBy: ['content'],
  },
})

console.log(bucket.id)

Ingest the medical corpus

Write the medical documents with stable corpus metadata so benchmark queries can read from the same corpus as the gold answer.

import { readFile } from 'node:fs/promises'
import { typegraphInit } from '@typegraph-ai/sdk'

type MedicalDocument = {
  id: string
  name: string
  corpus?: string
  text: string
  url?: string
  metadata?: Record<string, unknown>
}

const typegraph = await typegraphInit({
  apiKey: process.env.TYPEGRAPH_API_KEY!,
  tenantId: process.env.TYPEGRAPH_TENANT_ID!,
})

const bucketId = process.env.TYPEGRAPH_BUCKET_ID!
const corpus = JSON.parse(await readFile('./medical-corpus.json', 'utf8')) as MedicalDocument[]

await typegraph.document.ingest(
  corpus.map((documentRow) => ({
    id: documentRow.id,
    name: documentRow.name,
    content: documentRow.text,
    url: documentRow.url,
    metadata: {
      ...(documentRow.metadata ?? {}),
      corpus: documentRow.corpus ?? 'Medical',
      stableDocumentId: documentRow.id,
    },
  })),
  {
    bucketId,
    graphExtraction: true,
    chunkSize: 512,
    chunkOverlap: 64,
  },
)

Run a corpus-scoped graph search

For each benchmark question, search the medical benchmark bucket/graph and pass the SDK-built markdown prompt downstream unchanged.

const response = await typegraph.search(question.text, {
  buckets: [process.env.TYPEGRAPH_BUCKET_ID!],
  graph: process.env.TYPEGRAPH_GRAPH_ID ?? 'public',
  resources: ['documents', 'facts', 'entities'],
  limit: 12,
  weights: {
    semantic: 1,
    bm25: 0.7,
    graph: 0.5,
    recency: 0.3,
  },
  promptBuilder: {
    format: 'markdown',
    sections: ['chunks', 'facts'],
    includeAttributes: false,
  },
})

// Use response.prompt as the full answer-generation context.
console.log(response.prompt)

Evaluation loop outline

The public pieces are corpus-scoped retrieval, SDK-built prompts, answer generation, and JSONL result logging. Use the official GraphRAG-Bench scorer or your own judge for final metrics.

import { appendFile, readFile } from 'node:fs/promises'
import { typegraphInit } from '@typegraph-ai/sdk'

type Question = {
  id: string
  corpus?: string
  text: string
  questionType: string
}

const typegraph = await typegraphInit({
  apiKey: process.env.TYPEGRAPH_API_KEY!,
  tenantId: process.env.TYPEGRAPH_TENANT_ID!,
})

const questions = JSON.parse(await readFile('./medical-queries.json', 'utf8')) as Question[]

for (const question of questions) {
  const response = await typegraph.search(question.text, {
    buckets: [process.env.TYPEGRAPH_BUCKET_ID!],
    graph: process.env.TYPEGRAPH_GRAPH_ID ?? 'public',
    resources: ['documents', 'facts', 'entities'],
    limit: 12,
    weights: { semantic: 1, bm25: 0.7, graph: 0.5 },
    promptBuilder: { format: 'markdown', sections: ['chunks', 'facts'] },
  })

  const answer = await generateAnswer({
    question: question.text,
    context: response.prompt,
  })

  await appendFile(
    './results.jsonl',
    JSON.stringify({
      id: question.id,
      corpus: question.corpus ?? 'Medical',
      questionType: question.questionType,
      answer,
      prompt: response.prompt,
      retrieval: response.results,
    }) + '\n',
  )
}

Answer generation prompt used

Use one concise, context-grounded prompt across all question types and pass the retrieved context exactly as returned by the SDK.

---Role---
You are a helpful assistant responding to user queries.

---Goal---
Generate a direct, concise answer based strictly on the provided Context.
Answer only what the Question asks. Do not restate the Question, explain your reasoning, or add background details.
Use one sentence when possible. For multi-part or creative requests, use the shortest complete answer that satisfies the Question.
If asked to summarize, summarize the relationships, effect, contrast, or implication asked about, not the whole passage.
Stay grounded in the Context and avoid unsupported specifics.
If the Context contains partial relevant evidence, synthesize the supported answer instead of refusing.
Respond in plain text without formatting.
Use the same language as the Question.
Default to 5-20 words; exceed 25 words only when the Question explicitly asks for a summary, comparison, explanation, or creative response.
If the answer can be expressed as a name, list, date, place, relation, or short clause, output only that.

References

GraphRAG-Bench leaderboard

Published benchmark site and leaderboard for GraphRAG-Bench Medical.

GraphRAG-Bench dataset

Hugging Face dataset card for the Novel and Medical splits.

GraphRAG-Bench paper

Paper describing the benchmark tasks and scoring methodology.