GraphRAGGraphRAG-Bench Novel

GraphRAG-Bench Novel Answer Accuracy

Published May 1, 2026 · Updated May 1, 2026

TypeGraph scored 0.6265 ACC on all 2,010 GraphRAG-Bench Novel questions with semantic, BM25, and graph retrieval; observed search latency was 794ms p50 and 1.69s p95.

What this page shows

This is a TypeGraph Cloud answer-quality run on GraphRAG-Bench Novel, a benchmark that tests generated answers over public-domain books rather than only checking whether retrieval returned a known document ID.

The run used semantic, BM25, and graph retrieval, passed the SDK-native markdown prompt directly into a single tuned answer prompt, and scored answers with the GraphRAG-Bench LLM-as-judge ACC calculation.

Query latency

794ms p50

1.69s p95 end-to-end TypeGraph query latency

Latency is measured across the benchmark retrieval requests. Answer generation and judge calls are not included in these TypeGraph query latency percentiles.

Average ACC

0.6265

Answer correctness across all 2,010 GraphRAG-Bench Novel questions

Total Benchmark Time

64m 48s

Total ingest and eval time, including query and judge (running 20 evals concurrently)

Total TypeGraph Cost

$34.57

TypeGraph-metered ingest plus retrieval cost; answer generation and judge calls excluded

Executive Summary

GraphRAG-Bench Novel is split across direct fact retrieval, complex reasoning, contextual summarization, and creative generation questions. The overall score is the average answer correctness across all 2,010 questions, not a retrieval-only metric.

TypeGraph scored 0.6265 ACC overall. The strongest category was Contextual Summarize at 0.6446 ACC with 0.8482 coverage, followed closely by Fact Retrieval at 0.6351 ACC and Complex Reasoning at 0.6263 ACC.

Creative Generation remains the hardest category in this run at 0.4072 ACC. That category also exposes a different tradeoff: faithfulness was 0.6212, while coverage was 0.4047, suggesting the generated responses tended to stay grounded but often missed some requested creative or contextual elements.

Benchmark Dataset

The Novel split contains public-domain book passages and generated questions that exercise fact lookup, multi-hop reasoning, summarization, and creative generation grounded in retrieved context.

Property	Value
Dataset	GraphRAG-Bench Novel
Category	Answer-quality GraphRAG benchmark
Corpus	1,147 source documents
Indexed chunks	Indexed with 512-token chunks and 64-token overlap
Queries	2,010 questions
Qrels	Gold answers and question-type labels
Chunking	512 tokens, 64 overlap
Ingest time	46m 18s

Ingest time covers corpus indexing, chunking, graph extraction, and retrieval index construction.

Methodology

Loaded the GraphRAG-Bench Novel queries and gold answers from the benchmark dataset.
Searched the indexed TypeGraph corpus with semantic, BM25, and graph weights enabled.
Requested SDK-native markdown context with chunk and fact sections and passed response.prompt directly into answer generation.
Generated answers with openai/gpt-4o-mini.
Scored answers with the GraphRAG-Bench ACC method: LLM-judged factuality plus embedding-based semantic similarity.

Detailed Metrics Overview

Before we dive into the leaderboard, here's a quick overview of the metrics, TypeGraph Cloud's scores, and how to read them:

Metric	TypeGraph Score	How to read it
Overall ACC	0.626541	Primary GraphRAG-Bench answer-quality score across all 2,010 questions. Judges if the answer is factually equivalent to the gold answer.
Overall ROUGE-L	0.377493	Text overlap with the gold answer; useful but can underrate good paraphrases.
Fact Retrieval ACC	0.635099	971 direct fact questions. Did you return the correct specific fact? Who killed X? In what year did Y happen? Easy to judge: the answer is a name, date, or short phrase.
Complex Reasoning ACC	0.626275	610 reasoning questions. Did you correctly chain multiple facts together? Why did X betray Y? Judge checks the conclusion and often the linking steps.
Contextual Summarize ACC	0.644626	362 summarization questions with coverage judging. Does your summary correctly cover the requested entities and relationships? Judge checks whether the key facts are present and accurate.
Creative Generation ACC	0.407238	67 creative questions with faithfulness and coverage judging. Does the creative output stay faithful to the source while fulfilling the creative ask? Judge checks both grounding (no hallucinated facts) and form (did you actually write a scene, not a one-liner).

How to read GraphRAG-Bench ACC

GraphRAG-Bench ACC is a continuous answer-quality score from 0 to 1. It is not exact match and it is not a BEIR retrieval metric like nDCG@10. The benchmark decomposes generated and gold answers into statements, judges factual overlap, and blends that with semantic similarity.

GraphRAG-Bench Novel Leaderboard Comparison

Published comparison rows come from the GraphRAG-Bench Novel leaderboard values. The highlighted TypeGraph row uses the same percentage scale.

Rank	System	Avg ACC	Fact Retrieval		Complex Reasoning		Contextual Summarize		Creative Generation
Rank	System	Avg ACC	ACC	ROUGE-L	ACC	ROUGE-L	ACC	Cov	ACC	FS	Cov
1	AutoPrunedRetriever-llm	63.72%	45.99	26.99	62.80	35.35	83.10	83.86	62.97	34.40	22.13
2	TypeGraph Cloud	62.65%	63.51	45.49	62.63	30.46	64.46	84.82	40.72	62.12	40.47
3	G-reasoner	58.94%	60.07	36.93	53.92	23.00	71.28	55.60	50.48	54.24	45.44
4	HippoRAG2	56.48%	60.14	31.35	53.38	33.42	64.10	70.84	48.28	49.84	30.95
5	Fast-GraphRAG	52.02%	56.95	35.90	48.55	21.12	56.41	80.82	46.18	57.19	36.99
6	MS-GraphRAG (local)	50.93%	49.29	26.11	50.93	24.09	64.40	75.58	39.10	55.44	35.65
7	Lazy-GraphRAG	50.59%	51.65	36.97	49.22	23.48	58.29	76.94	43.23	50.69	39.74
8	StructRAG	49.13%	53.84	26.73	46.27	23.49	54.28	63.56	42.16	52.68	36.75
9	RAG (w rerank)	48.35%	60.92	36.08	42.93	15.39	51.30	83.64	38.26	49.21	40.04
10	KGP	48.01%	54.15	24.73	46.31	16.91	51.21	64.34	40.37	52.55	34.65
11	RAG (w/o rerank)	47.93%	58.76	37.35	41.35	15.12	50.08	82.53	41.52	47.46	37.84
12	KET-RAG	47.62%	55.39	27.39	36.59	25.98	52.47	69.24	46.03	36.72	33.68
13	LightRAG	45.09%	58.62	35.72	49.07	24.16	48.85	63.05	23.80	57.28	25.01
14	HippoRAG	44.75%	52.93	26.65	38.52	11.16	48.70	85.55	38.85	71.53	38.97
15	MS-GraphRAG (global)	44.52%	36.92	17.32	43.17	15.12	56.87	80.55	41.11	75.15	30.34
16	RAPTOR	43.24%	49.25	23.74	38.59	11.66	47.10	82.33	38.01	70.85	35.88

Graph Footprint

Metric	Value	How to read it
Documents	1,147	Source documents indexed for the benchmark.
Document groups	20	One group per source novel, used to scope each benchmark query.
Chunks / passage nodes	3,416	Indexed chunks at 512-token chunking with 64-token overlap; graph passage nodes mirror the chunks.
Semantic entities / graph nodes	10,793	Resolved graph entities extracted from the novel corpus.
Semantic edges	11,652	Stored relationship edges between semantic entities.
Entity chunk mentions	53,401	Entity mention rows linking extracted entities back to chunks.
Passage entity edges	25,211	Edges between passage nodes and entities for graph-anchored retrieval.

Metered Cost

Ingest

$34.50

Eval

$0.067

Total

$34.57

Meter	Usage	Rate	Cost
Ingest embeddings	2,431,064 tokens	$0.12 / M tokens	$0.29
Ingest LLM input	24,665,606 tokens	$1.00 / M tokens	$24.66
Ingest LLM output	1,884,015 tokens	$3.00 / M tokens	$5.65
Ingest compute	26,987,113 ms	$0.52 / CPU-hour	$3.89
Eval search embeddings	60,678 tokens	$0.04 / M tokens	$0.0024
Eval retrieval compute	444,918 ms	$0.52 / CPU-hour	$0.0643

Storage, answer generation, and judge calls are excluded. Costs use TypeGraph metered usage only: ingest embeddings at $0.12/M tokens, search embeddings at $0.04/M tokens, LLM input at $1.00/M tokens, LLM output at $3.00/M tokens, and compute at $0.52/CPU-hour.

Relevant Code

Create a graph-enabled bucket

Create a bucket with stable chunking, graph extraction enabled, and explicit embedding settings. Tenant isolation comes from the client tenantId; benchmark corpus separation is handled by bucket and graph selection.

import { typegraphInit } from '@typegraph-ai/sdk'

const typegraph = await typegraphInit({
  apiKey: process.env.TYPEGRAPH_API_KEY!,
  tenantId: process.env.TYPEGRAPH_TENANT_ID!,
})

const bucket = await typegraph.bucket.create({
  name: 'graphrag-bench-novel',
  indexDefaults: {
    chunkSize: 512,
    chunkOverlap: 64,
    graphExtraction: true,
    deduplicateBy: ['content'],
  }
})

console.log(bucket.id)

Ingest documents by corpus group

Use document metadata and bucket/graph selection so benchmark queries can retrieve from the same corpus as the gold answer. In our test, we ingested concurrent batches of 300 documents because TypeGraph has an upper limit of 3 MB payloads for ingestion batches.

import { readFile } from 'node:fs/promises'
import { typegraphInit } from '@typegraph-ai/sdk'

type NovelDocument = {
  id: string
  name: string
  corpus: string
  text: string
  url?: string
  metadata?: Record<string, unknown>
}

const typegraph = await typegraphInit({
  apiKey: process.env.TYPEGRAPH_API_KEY!,
  tenantId: process.env.TYPEGRAPH_TENANT_ID!,
})

const bucketId = process.env.TYPEGRAPH_BUCKET_ID!
const corpus = JSON.parse(await readFile('./novels.json', 'utf8')) as NovelDocument[]

const byCorpus = new Map<string, NovelDocument[]>()

for (const documentRow of corpus) {
  byCorpus.set(documentRow.corpus, [...(byCorpus.get(documentRow.corpus) ?? []), documentRow])
}

for (const [corpusId, documents] of byCorpus) {
  await typegraph.document.ingest(
    documents.map((documentRow) => ({
      id: documentRow.id,
      name: documentRow.name,
      content: documentRow.text,
      url: documentRow.url,
      metadata: {
        ...(documentRow.metadata ?? {}),
        corpus: corpusId,
        stableDocumentId: documentRow.id,
      },
    })),
    {
      bucketId,
      graphExtraction: true,
      chunkSize: 512,
      chunkOverlap: 64,
    },
  )
}

Run a corpus-scoped graph search

For each benchmark question, search the benchmark bucket/graph and pass the SDK-built markdown prompt downstream unchanged.

const response = await typegraph.search(question.text, {
  buckets: [process.env.TYPEGRAPH_BUCKET_ID!],
  graph: process.env.TYPEGRAPH_GRAPH_ID ?? 'public',
  resources: ['documents', 'facts', 'entities'],
  limit: 12,
  weights: {
    semantic: 1,
    bm25: 0.7,
    graph: 0.5,
    recency: 0.3,
  },
  promptBuilder: {
    format: 'markdown',
    sections: ['chunks', 'facts'],
    includeAttributes: false,
  },
})

// Use response.prompt as the full answer-generation context.
console.log(response.prompt)

Evaluation loop outline

The public pieces are corpus-scoped retrieval, SDK-built prompts, answer generation, and JSONL result logging. Use the paper scorer or your own judge for final metrics. In our test, we ran 20 concurrent evals (eval = search+judge).

import { appendFile, readFile } from 'node:fs/promises'
import { typegraphInit } from '@typegraph-ai/sdk'

type Question = {
  id: string
  corpus: string
  text: string
  questionType: string
}

const typegraph = await typegraphInit({
  apiKey: process.env.TYPEGRAPH_API_KEY!,
  tenantId: process.env.TYPEGRAPH_TENANT_ID!,
})

const questions = JSON.parse(await readFile('./queries.json', 'utf8')) as Question[]

for (const question of questions) {
  const response = await typegraph.search(question.text, {
    buckets: [process.env.TYPEGRAPH_BUCKET_ID!],
    graph: process.env.TYPEGRAPH_GRAPH_ID ?? 'public',
    resources: ['documents', 'facts', 'entities'],
    limit: 12,
    weights: { semantic: 1, bm25: 0.7, graph: 0.5 },
    promptBuilder: { format: 'markdown', sections: ['chunks', 'facts'] },
  })

  const answer = await generateAnswer({
    question: question.text,
    context: response.prompt,
  })

  await appendFile(
    './results.jsonl',
    JSON.stringify({
      id: question.id,
      corpus: question.corpus,
      questionType: question.questionType,
      answer,
      prompt: response.prompt,
      retrieval: response.results,
    }) + '\n',
  )
}

Answer generation prompt used

You can see the answer generation prompt that was used in the benchmark runner. One important thing to note, is that in our benchmark we used gpt-4o-mini to generate the answers. In the original benchmark paper, they used qwen2.5-14b-instruct. One impact of this, is that gpt-4o-mini is faster to generate answers than qwen2.5-14b-instruct, and we were able to run 20 evals concurrently in the same time window. However, gpt4o-mini tends to be more verbose and creative, which hurts ACC scoring as it adds prose and context that is irrelevant.

---Role---
You are a helpful assistant responding to user queries.

---Goal---
Generate a direct, concise answer based strictly on the provided Context.
Answer only what the Question asks. Do not restate the Question, explain your reasoning, or add background details.
Use one sentence when possible. For multi-part or creative requests, use the shortest complete answer that satisfies the Question.
If asked to summarize, summarize the relationships, effect, contrast, or implication asked about, not the whole passage.
Stay grounded in the Context and avoid unsupported specifics.
If the Context contains partial relevant evidence, synthesize the supported answer instead of refusing.
When the question asks for a chain, connection, relation, sequence, or how X links to Y, answer with only the explicit named relation chain from the context; do not explain themes, causes, motives, or background unless asked.
Respond in plain text without formatting.
Use the same language as the Question.
No markdown headings, no placeholder dates, do not refuse historical/fictional perspective tasks.
Default to 5-20 words; exceed 25 words only when the Question explicitly asks for a summary, comparison, explanation, or creative response.
If the answer can be expressed as a name, list, date, place, relation, or short clause, output only that.

References

GraphRAG-Bench leaderboard

Published benchmark site and leaderboard for GraphRAG-Bench.

GraphRAG-Bench paper

Paper describing the benchmark tasks and scoring methodology.