Incremental Re-Indexing for RAG: How to Keep a Million-Document Corpus Current Without Re-Embedding Everything

The full re-index problem

Your knowledge base started with 10,000 documents. Indexing them took 20 minutes and cost $5 in embedding API calls. Fast forward six months: you have 500,000 documents, nightly updates touch maybe 2% of them, but your indexing job processes all 500,000 every night. It takes 8 hours, costs $200, and if it fails halfway through, you're serving stale data until the next successful run.

This is the full re-index trap, and nearly every RAG team falls into it. The initial indexing script doesn't distinguish between new, changed, and unchanged documents - it just processes everything. That's fine at 10K documents. It's untenable at scale.

Content-hash change detection

The core idea behind incremental re-indexing is simple: only re-process documents that have actually changed. The implementation requires a content-addressable approach - compute a hash of each document's content, store it alongside the indexed chunks, and compare on subsequent runs.

On each indexing run:

Scan your document source to get the current set of documents and their content hashes.
Compare against stored hashes from the previous run.
Categorize each document as new (no previous hash), changed (hash differs), unchanged (hash matches), or deleted (previous hash exists but document is gone).
Only process new and changed documents. Skip unchanged ones entirely.

The hash comparison itself is fast - even for a million documents, comparing SHA-256 hashes is a sub-second operation. The expensive part (chunking, embedding, and storing) only happens for the documents that actually changed. For a typical corpus where 1-5% of documents change daily, this reduces indexing cost and time by 95%+.

Upsert vs. replace: handling changed documents

When a document changes, you have two options for updating the index:

Replace mode deletes all existing chunks for the changed document, then re-chunks and re-indexes it from scratch. This is simpler and guarantees consistency - you'll never have stale chunks lingering from a previous version. The downside is that chunk IDs change, which can break any external references to specific chunks.
Upsert mode re-chunks the document, computes chunk-level hashes, and only updates chunks that actually changed. If a minor edit changed one paragraph in a 50-page document, only the chunks containing that paragraph get re-embedded. This is more efficient but more complex - you need chunk-level change tracking and careful handling of chunks that were added, removed, or shifted in position.

For most teams, replace mode is the right starting point. The per-document cost of re-embedding a handful of chunks is low, and the implementation simplicity pays for itself in reduced bugs. Move to upsert mode only if your changed documents are very large and changes are very localized. The Pinecone upsert documentation and similar guides from other vector stores provide good starting points for the mechanics.

Pruning deleted documents

The most commonly overlooked aspect of incremental indexing is deletion handling. When a document is removed from your source, its chunks linger in the vector store forever unless you explicitly delete them. Over time, this creates a growing layer of "ghost chunks" - content that no longer exists in your knowledge base but still appears in search results.

The fix is a reconciliation step at the end of each indexing run: compare the set of document IDs in your vector store against the current set of documents in your source. Any document ID present in the store but absent from the source gets its chunks deleted. This is the data quality equivalent of garbage collection - simple in concept, critical in practice.

Cursor-based sync for large corpora

For very large corpora (millions of documents), even the change detection scan can be expensive if it requires reading every document's content to compute hashes. A cursor-based approach uses your document source's native change tracking - last-modified timestamps, change data capture streams, or webhook notifications - to identify candidates for re-processing without scanning the entire corpus.

Store a cursor (typically a timestamp or sequence number) after each successful sync run. On the next run, only query for documents modified since the cursor. This reduces the initial scan from "read all documents" to "read documents changed since last sync," which can be orders of magnitude faster if your document source supports efficient range queries on modification time.

Cost comparison: full vs. incremental

For a 500,000-document corpus with 2% daily churn using OpenAI's text-embedding-3-large:

Full re-index: ~500K documents × ~10 chunks/doc × ~500 tokens/chunk = 2.5B tokens/day ≈ $200/day in embedding costs. Incremental: ~10K changed documents × ~10 chunks/doc × ~500 tokens/chunk = 50M tokens/day ≈ $4/day. That's a 50x cost reduction, plus the indexing job completes in minutes instead of hours. The numbers scale linearly - at 5M documents with the same 2% churn rate, full re-indexing costs ~$2,000/day while incremental costs ~$40.

Our indexing costs dropped from $6,200/month to $124/month after switching to incremental re-indexing. The engineering effort was about two days of work - probably the best ROI of any optimization we've done.

Implementation path

Start by adding content hashes to your indexing pipeline and storing them in a metadata table. On the next full run, you'll have a baseline to compare against. From there, add the change detection logic, skip unchanged documents, and add deletion reconciliation. You can build this yourself, or use a platform like TypeGraph that provides content-hash deduplication, incremental sync with cursor tracking, and automatic pruning of deleted documents out of the box - turning a multi-day indexing job into a minutes-long incremental sync.

Incremental Re-Indexing for RAG: How to Keep a Million-Document Corpus Current Without Re-Embedding Everything

The full re-index problem

Content-hash change detection

Upsert vs. replace: handling changed documents

Pruning deleted documents

Cursor-based sync for large corpora

Cost comparison: full vs. incremental

Implementation path

More Posts

What We Learned Testing Embedding Dimensions and pgvector halfvec for RAG

From Human Memory to Machine Memory: A Field Guide to AI Memory Architecture

Sensory Memory: The Quarter-Second Buffer Behind Whisper and Kafka