Entity Resolution in RAG Pipelines: How to Merge Duplicate Entities Across Unstructured Documents

The entity fragmentation problem nobody budgets for

You have ingested 50,000 documents into your RAG pipeline. Your knowledge graph extraction is running smoothly - entities are being pulled from text, relationships are being linked, and your retrieval quality looks promising in early tests. Then someone asks: "What is JPMorgan Chase's total exposure across all counterparty agreements?"

The answer comes back incomplete. Not because the documents are missing, but because your graph contains fourteen distinct nodes for what is obviously the same organization: "JPMorgan Chase," "JP Morgan Chase & Co.," "J.P. Morgan," "JPMorgan Chase Bank, N.A.," "JPMC," "Chase," and eight more variants sourced from legal filings, press releases, and internal memos. Each node has its own set of relationships. None of them are connected to each other.

This is entity fragmentation, and it is the single most common failure mode in knowledge-graph-augmented RAG systems. It silently degrades retrieval quality because the graph cannot traverse across what should be a single entity. Multi-hop queries - the entire reason you built a knowledge graph in the first place - break down when the hops dead-end at orphaned duplicates.

Why naive string matching falls apart immediately

The first instinct is to normalize entity names: lowercase everything, strip punctuation, maybe apply a Levenshtein distance threshold. This handles the easy cases - "JPMorgan" vs "jpmorgan" - but fails catastrophically on the hard ones. "Chase" and "JPMorgan Chase" have a large edit distance but refer to the same entity in many contexts. "Apple" the technology company and "Apple" the record label are string-identical but entirely different entities. "Deutsche Bank AG" and "DB" share almost no characters.

String similarity metrics like Levenshtein, Jaro-Winkler, and even token-overlap ratios operate purely on surface form. They have no understanding of what the entity is. For production entity resolution, you need to combine multiple weights: surface-form similarity, embedding-based semantic similarity, contextual co-occurrence, and structural features from the graph itself.

Embedding-based fuzzy matching: the foundation layer

A more robust approach starts with encoding entity mentions into dense vector representations. Rather than comparing raw strings, you embed each entity mention along with a window of surrounding context - typically 1-2 sentences on either side of the mention. This gives you a vector that captures not just the name but the role the entity plays in the text.

With these contextual embeddings, you can compute cosine similarity between candidate pairs. "JPMorgan Chase Bank, N.A." mentioned in a lending agreement and "JPMC" mentioned in a credit risk report will produce similar embeddings because their surrounding contexts discuss the same domain: counterparty risk, loan origination, regulatory capital.

The practical implementation involves two phases. First, a blocking phase where you use cheap heuristics - shared tokens, phonetic codes via Metaphone or Soundex, character n-gram overlap - to generate candidate pairs without comparing every entity to every other entity. For a graph with N entities, naive pairwise comparison is O(N²), which is untenable at scale. Blocking reduces this to O(N × k) where k is the average block size. Second, a scoring phase where you compute embedding similarity, string similarity, and attribute overlap for each candidate pair, combining them into a single confidence score.

Research from the VLDB 2021 benchmark on entity matching demonstrates that hybrid approaches combining pre-trained language model embeddings with traditional similarity features consistently outperform either method alone, particularly on noisy real-world data where entity names are abbreviated, misspelled, or translated across languages.

Alias tracking and canonical entity management

Once you have identified that multiple mentions refer to the same real-world entity, you need a system for managing the merge. The cleanest pattern is the canonical entity with alias set model. One node is designated as the canonical representation. All other mentions become aliases that point to it. The canonical node accumulates the union of all relationships from its aliases.

Your alias registry should store each alias along with metadata: the source document it was extracted from, the confidence score of the resolution, and a timestamp. This provenance tracking is critical for debugging. When a user asks "why did the system merge these two entities?" you need to be able to answer with specifics, not just "the confidence was above threshold."

An important subtlety: aliases are not always symmetric in practice. "Chase" is a valid alias for "JPMorgan Chase" in a financial context, but in a consumer banking context, "Chase" might refer specifically to the retail banking division. Your alias registry should support context-scoped aliases where the mapping is conditional on document type, source, or domain.

Transitive merge strategies and the union-find approach

Entity resolution pairs are transitive. If entity A matches entity B, and entity B matches entity C, then A and C should also be merged - even if the direct A-to-C similarity is below your threshold. This transitivity is both a feature and a danger.

The standard data structure for managing transitive merges is union-find (disjoint set union). Each entity starts in its own set. When a pair is resolved as matching, their sets are merged. The canonical entity for each set is the root of the tree, typically chosen based on a priority rule: prefer the longest name variant, the most frequently occurring variant, or the variant from the most authoritative source.

The danger of transitive closure is merge drift. Consider a chain: "International Business Machines" → "IBM" → "IBM Cloud" → "IBM Cloud Pak for Data" → "Cloud Pak for Data" → "Cloud Pak." Each pairwise match seems reasonable, but the chain has drifted from a corporation to a specific software product. Left unchecked, transitive merges can collapse entire taxonomies into a single mega-entity.

The mitigation is a merge diameter constraint. Before completing a transitive merge, measure the similarity between the most distant members of the resulting set. If the minimum pairwise similarity within the set drops below a floor threshold - say 0.5 when your pairwise merge threshold is 0.85 - reject the merge and keep the sets separate. This is computationally more expensive than naive union-find but prevents the catastrophic chain-drift problem.

Handling contradictory attributes across entity records

When two entity records merge, their attributes may conflict. Document A says the company's headquarters is in New York. Document B says it is in Dallas. Document C, which is older, says Houston. Which one wins?

The naive approach - "most recent document wins" - is wrong more often than teams expect. The most recent document might be a poorly OCR'd scan of a five-year-old report. Instead, implement a multi-signal attribute resolution strategy that considers:

Source authority ranking: Internal filings > official regulatory documents > press releases > third-party reports > social media. Assign each source category a numeric authority weight and use it to break ties.
Temporal validity: Some attributes are time-dependent. A company's headquarters can change. Rather than picking one value, store the attribute as a temporal series - headquarters was Houston from 2015-2020, then Dallas from 2020-present. Your retrieval layer can then return the value appropriate to the query's time context.
Contradiction flagging: When attributes conflict and neither source clearly dominates, do not silently pick one. Instead, flag the entity as having an unresolved contradiction. Surface this to the user so they can make an informed judgment. A retrieval system that says "sources disagree on this point" is far more trustworthy than one that confidently returns the wrong value. For a deeper dive into building systems that detect and surface these conflicts, see our post on contradiction detection in RAG knowledge bases.

Offline batch resolution vs. online incremental resolution

Entity resolution can run in two modes, and most production systems need both.

Offline batch resolution processes your entire entity corpus at once. It is computationally expensive - even with blocking, resolving millions of entities can take hours - but it produces the highest-quality results because it has access to the full global context. Every entity can be compared against every other entity within its block. Transitive chains can be fully evaluated. Attribute conflicts can be resolved with the complete set of evidence.

Run batch resolution on initial data load and periodically thereafter - weekly or monthly depending on your corpus velocity. Use it to establish the baseline canonical entity graph.

Online incremental resolution processes new entities as documents are ingested. When a new document introduces an entity mention, the system must decide in near-real-time whether it matches an existing canonical entity or represents a genuinely new entity. This is a fundamentally harder problem because you are making local decisions without global context.

The practical pattern is to use an approximate nearest neighbor (ANN) index over your canonical entity embeddings. When a new mention arrives, query the index for the top-k most similar canonical entities. If the highest similarity exceeds your merge threshold, merge the new mention into that canonical entity. If not, create a new canonical entity. Periodically, run offline batch resolution to catch merges that the incremental process missed.

This dual-mode architecture gives you the responsiveness of real-time processing and the accuracy of batch processing. The Microsoft GraphRAG paper explores how community-level entity summarization can further improve resolution quality by leveraging graph structure during the batch phase.

We were running entity extraction on 200,000 legal documents and ended up with over 40,000 duplicate entity nodes before we built proper resolution. After implementing embedding-based matching with transitive merge constraints, we collapsed that to 12,000 canonical entities - and our multi-hop retrieval accuracy jumped from 34% to 71% overnight.

Measuring entity resolution quality

Entity resolution is a classification problem - you are deciding whether pairs of mentions refer to the same entity. The standard metrics apply: precision (what fraction of merged pairs are correct), recall (what fraction of true matches were found), and F1. But there is a subtlety: in production, the cost of false merges (merging two different entities) is typically much higher than the cost of missed merges (failing to connect two mentions of the same entity).

A false merge contaminates the canonical entity's relationships. If "Apple Inc." gets merged with "Apple Records," every query about Apple's technology products will also return Beatles licensing agreements. The damage propagates through every downstream retrieval. A missed merge, by contrast, merely means some queries return incomplete results - which is bad, but recoverable.

For this reason, most production systems set their merge threshold to favor high precision over high recall, accepting some fragmentation in exchange for clean canonical entities. Typical thresholds are in the 0.85-0.92 range for the combined similarity score. You can read more about evaluating retrieval quality across your entire pipeline in our post on RAG retrieval evaluation metrics.

Where TypeGraph fits in

TypeGraph's knowledge graph layer includes built-in entity resolution that combines embedding-based matching, alias tracking with provenance, and configurable transitive merge constraints. Both batch and incremental resolution modes are supported, with automatic contradiction detection when merged entities carry conflicting attributes. The state-of-the-art research on entity matching with language models continues to advance rapidly, and TypeGraph's architecture is designed to incorporate these improvements as the field evolves - so your knowledge graph gets cleaner over time, not more fragmented.