Multi-Hop Reasoning Over Knowledge Graphs in RAG: When One Retrieval Step Isn't Enough

The question that breaks every standard RAG pipeline

Consider this query from a pharmaceutical compliance team: "Which suppliers of Acme Pharma have had FDA warning letters in the past 24 months, and which of our products depend on those suppliers?"

Answering this requires chaining three distinct facts: (1) Acme Pharma's supplier relationships, (2) FDA enforcement actions against those suppliers, and (3) product-to-supplier dependency mappings. These facts live in different documents - procurement contracts, FDA databases, and bill-of-materials records. No single chunk contains the complete answer. No single embedding captures all three relationships.

Standard RAG retrieves the top-k chunks most similar to the search embedding and passes them to the LLM. This works beautifully for questions where the answer lives in one place. It fails systematically for questions that require compositional reasoning - connecting information across multiple documents through a chain of relationships. These are called multi-hop queries, and they are precisely the queries that matter most in enterprise knowledge work.

Why vector similarity cannot solve multi-hop retrieval

The fundamental limitation is geometric. When you embed the query "Which suppliers of Acme Pharma have had FDA recalls?", the resulting vector occupies a region of embedding space that is semantically close to documents about Acme Pharma, documents about FDA recalls, and documents about supplier relationships. But the specific document you need - the one that says "Supplier Z received FDA warning letter FDA-2025-0847" - may not mention Acme Pharma at all. Its embedding is close to the FDA-recall region but far from the Acme-Pharma region. It will not rank in the top-k.

You could try to brute-force this by increasing k to retrieve more chunks, hoping the relevant ones appear somewhere in the larger set. But increasing k introduces noise faster than signal. At k=5, most chunks are relevant. At k=50, maybe 10 are relevant and 40 are distractors. The LLM now has to perform its own relevance filtering in the context window, and its attention is spread across mostly irrelevant content - a problem well-documented in the "Lost in the Middle" research by Liu et al. that demonstrates LLMs struggle to use information placed in the middle of long contexts.

The solution is not to retrieve more chunks from the same flat index. It is to retrieve differently - using the structure of relationships between entities to guide the retrieval process across multiple hops.

Graph traversal as a retrieval signal

A knowledge graph encodes entities and their relationships as nodes and edges. "Acme Pharma" is a node. "Supplier Z" is a node. The edge between them says "supplies_to." "FDA Warning Letter FDA-2025-0847" is a node connected to Supplier Z by an "enforcement_action" edge. "Product Alpha" is a node connected to Supplier Z by a "depends_on" edge.

With this structure, answering the multi-hop query becomes a graph traversal problem:

Hop 1: Start at the "Acme Pharma" node. Traverse all "supplies_to" edges to find supplier nodes. Result: {Supplier Z, Supplier W, Supplier Y, ...}.
Hop 2: From each supplier node, traverse "enforcement_action" edges to find FDA-related nodes with dates within the past 24 months. Result: {FDA-2025-0847 (Supplier Z), FDA-2025-1203 (Supplier W)}.
Hop 3: From the flagged suppliers (Z and W), traverse "depends_on" edges back to product nodes. Result: {Product Alpha, Product Gamma}.

At each hop, you collect the source chunks associated with the traversed nodes and edges. These chunks - the procurement contract mentioning Supplier Z, the FDA database entry for warning letter FDA-2025-0847, the BOM listing Product Alpha's dependencies - form the evidence set that gets passed to the LLM. The result is a precisely targeted context that contains exactly the information needed to answer the query, with no wasted tokens on irrelevant chunks.

Seeding the traversal: entity extraction from the query

The first step in multi-hop retrieval is identifying the seed entities in the user's query. These are the starting nodes for graph traversal. In the example above, "Acme Pharma" is the obvious seed. But many queries contain implicit entities or require entity disambiguation before traversal can begin.

Consider: "What are the environmental compliance risks for our Southeast Asian operations?" The seed entities are not named explicitly - you need to resolve "our" to the user's organization, "Southeast Asian" to a set of geographic entities, and "operations" to a set of facility or subsidiary entities in that region. This requires a combination of entity extraction (NER on the query), coreference resolution (what does "our" refer to?), and entity linking (matching extracted mentions to nodes in the graph).

For reliable seed extraction, use a two-stage approach. First, run a lightweight NER model or LLM-based extraction to identify candidate entities in the query. Second, link each candidate to the graph using the entity resolution techniques - embedding similarity, alias lookup, and fuzzy matching - described in detail in our post on entity resolution in RAG pipelines. If a candidate cannot be linked to any graph node with confidence above threshold, fall back to vector search for that component of the query.

Single-hop, fixed-depth, and adaptive traversal: a comparison

Not all multi-hop queries require the same traversal strategy. The choice of strategy significantly affects both retrieval quality and latency.

Single-hop retrieval traverses only the immediate neighbors of the seed entities. It answers questions like "Who are Acme Pharma's suppliers?" but cannot answer "Which of those suppliers have had FDA recalls?" because that requires a second hop. Single-hop is fast (O(degree) per seed entity) and appropriate when the query asks about direct relationships.
Fixed-depth traversal traverses to a predetermined depth - typically 2 or 3 hops. Every seed entity is expanded to the same depth regardless of the query. This is simple to implement: run a breadth-first search to depth d, collecting all nodes and their associated chunks along the way. The downside is inefficiency. For a graph with average degree 10, a 3-hop traversal from a single seed touches up to 1,000 nodes. Most of those nodes are irrelevant to the specific query. You end up with a large candidate set that requires aggressive re-ranking to identify the useful chunks.
Adaptive traversal dynamically decides at each hop whether to continue expanding, which edges to follow, and when to stop. This is the most sophisticated approach and yields the best retrieval quality, but it requires a mechanism for making hop-by-hop decisions.

There are two paradigms for adaptive traversal:

Rule-based adaptive traversal: Define traversal policies based on edge types and query intent. If the query mentions "suppliers," follow "supplies_to" edges. If it mentions "recalls" or "enforcement," follow "enforcement_action" edges from the resulting nodes. The rules encode domain knowledge about which relationship chains are meaningful. This is deterministic, fast, and interpretable, but requires manual rule authoring for each query pattern.
LLM-guided adaptive traversal: At each hop, present the current set of discovered nodes and edges to the LLM and ask it to decide which edges to follow next. The LLM acts as a traversal planner, using its understanding of the query to navigate the graph. This is more flexible than rules - it can handle novel query patterns without explicit programming - but adds an LLM call at each hop, increasing latency. For a 3-hop query, that is 3 additional LLM calls on top of the final generation call.

In practice, the best approach is often a hybrid: use rules for common, well-understood query patterns (which accounts for 80%+ of production traffic) and fall back to LLM-guided traversal for novel or complex queries. This gives you the latency of rule-based traversal for most requests and the flexibility of LLM-guided traversal when needed.

Pulling chunks along the path: evidence assembly

A knowledge graph traversal produces a subgraph - a set of nodes and edges connecting the seed entities to the answer entities. But the LLM does not reason over graph structures directly. It needs text. The final step of multi-hop retrieval is evidence assembly: collecting the text chunks associated with each node and edge in the traversal path and organizing them into a coherent context.

Each node in the graph should maintain a link to its source chunks - the specific text passages from which it was extracted. When the traversal visits a node, pull its source chunks. When it traverses an edge, pull the chunks that evidence that relationship. The result is a set of chunks ordered by their position in the traversal path: first the chunks about the seed entity, then the chunks about the first-hop entities, then the second-hop chunks, and so on.

This ordering is significant. By placing chunks in traversal order, you give the LLM a natural reasoning chain: "Here is the starting entity. Here are its suppliers. Here are the enforcement actions against those suppliers. Here are the products that depend on those suppliers." The LLM can follow this chain without needing to piece together the reasoning from randomly ordered chunks.

A practical concern is chunk deduplication. The same chunk may be associated with multiple nodes in the traversal path. If a single contract document mentions both the supplier relationship and product dependencies, it will appear twice. Deduplicate by chunk ID before assembling the context, but preserve the earliest position in the traversal order so the chunk appears where it first becomes relevant.

Token budget management is also critical. A 3-hop traversal can easily produce 20-50 relevant chunks, far more than fit in a context window. Apply a path-aware scoring function that prioritizes chunks from the shortest path between seed and answer entities, de-prioritizes chunks from branches that did not lead to answer entities, and respects the token budget. For more on token-aware context assembly, see our post on context window assembly strategies.

Handling traversal failures gracefully

Multi-hop retrieval can fail at any hop. The seed entity might not exist in the graph. The expected edges might be missing. A hop might expand to thousands of nodes, blowing your latency budget. Robust systems need fallback strategies at each stage.

Missing seed entities: If entity linking fails, fall back to standard vector search over the full chunk index. You lose the multi-hop capability but still return relevant results for the single-hop component of the query.
Missing edges: If a hop finds no outgoing edges of the expected type, query the vector index for chunks that mention both the current entity and the next-hop concept. This is a hybrid fallback - using vector search to bridge gaps in the graph structure. The graph may be incomplete, but the underlying documents likely contain the relationship information even if extraction missed it.
Fan-out explosion: If a hop produces more than a configurable maximum number of nodes (e.g., 100), apply a beam search strategy: score all expanded nodes by their relevance to the query and keep only the top-b, where b is your beam width. This prevents a single high-degree node (like a major corporation with thousands of relationships) from overwhelming the traversal.

We benchmarked our compliance query system with and without multi-hop graph retrieval. On single-hop factoid questions, both approaches performed similarly - around 89% accuracy. On multi-hop questions that required connecting facts across documents, standard RAG dropped to 31% accuracy while graph-based retrieval maintained 74%. That 43-point gap is the difference between a useful tool and an expensive toy.

Measuring multi-hop retrieval quality

Standard retrieval metrics - recall@k, precision@k, MRR - measure single-hop retrieval quality. For multi-hop, you need additional metrics that capture whether the retrieval path is complete.

Path completeness: For a query requiring N hops, did the retrieval system return evidence from all N hops? A 3-hop query answered with evidence from only 2 hops is partially complete but may still produce an incorrect or incomplete answer.
Path correctness: Did the traversal follow the right edges? If the query asks about supply-chain risk but the traversal followed investment relationships instead, the path is incorrect even if it reached the right number of hops.
Answer coverage: Does the assembled evidence contain all the facts needed to answer the query? This is the end-to-end metric that matters most. Evaluate it by having domain experts annotate a test set with the required facts for each query, then measuring what fraction of those facts appear in the retrieved evidence.

Building a high-quality evaluation set for multi-hop queries is labor-intensive but essential. The HotpotQA benchmark provides a useful reference for multi-hop question-answer pairs, though you will need domain-specific test sets for production evaluation.

Where TypeGraph fits in

TypeGraph's Graph RAG layer supports all three traversal strategies - single-hop, fixed-depth, and adaptive - with automatic fallback to vector search when graph coverage is incomplete. The traversal engine performs entity extraction from queries, links to the knowledge graph, walks edges according to configurable policies, and assembles evidence chunks in traversal order with path-aware scoring and token budget management. For teams that need multi-hop reasoning without building a custom graph traversal engine, TypeGraph provides the infrastructure so you can focus on the domain-specific traversal policies that make your retrieval system genuinely useful.