RAGContext AssemblyLLM OptimizationRetrieval

Context Window Assembly Strategies: How to Pack the Most Useful Information Into Your LLM's Token Budget

Ryan Musser
Ryan Musser
Founder

Retrieval is only half the problem

Your retrieval pipeline returns a ranked list of relevant chunks. Now what? The default approach - concatenate the top-k chunks and prepend them to the prompt - is so common that most teams never question it. But the assembly step, where retrieved chunks are transformed into the actual text that fills the LLM's context window, has an outsized impact on answer quality. Two systems with identical retrieval can produce dramatically different answers based solely on how they assemble the context.

The context window is not a dump truck. It is a precision instrument with specific failure modes that respond to specific engineering strategies. Understanding those failure modes - and engineering around them - is the difference between a RAG system that works on demos and one that works in production.

Neighbor joining: re-attaching the context your chunking strategy removed

Chunking splits documents into retrievable units. Retrieval selects the most relevant units. But the act of chunking often severs information that was contextually connected. When a retrieved chunk is the second paragraph of a four-paragraph explanation, the first paragraph provides essential context that the chunk alone lacks - definitions, assumptions, or the problem statement that the retrieved paragraph is answering.

Neighbor joining (sometimes called context window expansion or chunk augmentation) addresses this by retrieving not just the matched chunk but also its immediate neighbors from the original document. If chunk 47 is retrieved with high relevance, also include chunks 46 and 48 in the context. The neighboring chunks provide the surrounding context that makes the matched chunk interpretable.

The implementation requires maintaining a chunk adjacency index - a mapping from each chunk ID to its predecessor and successor chunk IDs within the same document. When a chunk is retrieved, look up its neighbors and include them. If the same document contributes multiple retrieved chunks that are already adjacent (e.g., chunks 47 and 48 are both retrieved), merge them into a single contiguous passage rather than including the overlap twice.

How many neighbors to include is a tunable parameter. One neighbor on each side (a window of 3 chunks) is the most common setting and provides a good balance between context and token efficiency. For domains with long, flowing arguments - legal briefs, research papers, policy documents - a window of 5 (two neighbors on each side) often works better. For highly structured content with short, self-contained sections - FAQs, API references, configuration docs - neighbor joining may not help at all, and you can save tokens by skipping it.

A more sophisticated variant is selective neighbor joining. Instead of always including neighbors, use a classifier or heuristic to decide whether the retrieved chunk is self-contained. If the chunk begins with a sentence fragment, a pronoun reference ("This approach..."), or a continuation marker ("Additionally..."), it likely needs its predecessor. If it ends mid-sentence or with an incomplete thought, it needs its successor. Only join neighbors when the chunk boundary is semantically incomplete.

Token-aware truncation: respecting the budget without destroying coherence

Every context window has a token budget. After accounting for the system prompt, the user's query, and the reserved space for the model's response, you might have 8,000-12,000 tokens for retrieved context in a typical 16k-token setup. If your retrieval returns 10 chunks of 800 tokens each, you cannot fit them all. You need to truncate.

Naive truncation takes the top-k chunks by relevance score until the budget is exhausted. If the 7th chunk would overflow the budget, it is excluded entirely. This wastes the remaining token capacity - you might have 600 unused tokens that could carry useful partial information.

Token-aware truncation is more efficient. It operates in two passes:

  1. Pass 1 (greedy fill): Starting with the highest-ranked chunk, add chunks to the context as long as they fit within the budget. Track the remaining token count as you go.
  2. Pass 2 (gap fill): If there are remaining tokens after all full chunks have been placed, examine the next-ranked chunks that were excluded. If a chunk is too large, consider truncating it to its first N paragraphs or sentences to fit the remaining budget. Truncate at a sentence boundary to preserve coherence. A partial chunk with the first two paragraphs of a relevant section is often more useful than no chunk at all.

One critical implementation detail: always count tokens using the actual tokenizer for your target model. Token counts vary significantly between models - GPT-4's tokenizer and Claude's tokenizer produce different counts for the same text. If you estimate tokens using a generic method (like dividing character count by 4), you will either waste context capacity or overflow the window. Measuring the downstream impact of your truncation strategy requires tracking retrieval metrics end to end - our post on RAG retrieval evaluation covers the recall and precision metrics that matter most.

The "lost in the middle" problem: why ordering matters

In 2023, researchers at Stanford, UC Berkeley, and Samaya AI published a study showing that LLMs are significantly better at using information placed at the beginning and end of the context window than information placed in the middle. This finding, detailed in the "Lost in the Middle" paper, has direct implications for how you order retrieved chunks in your context.

The default ordering - ranked by relevance score, so the most relevant chunk is first - actually plays into this failure mode. The second and third most relevant chunks end up in the middle of the context, exactly where the LLM is least likely to attend to them. If those middle chunks contain critical supporting evidence, the model may generate an answer based only on the first and last chunks, missing key facts.

Several ordering strategies mitigate this effect:

  • Relevance-first-and-last (sandwich ordering): Place the most relevant chunk first, the second most relevant chunk last, and fill the middle with lower-ranked chunks. This ensures the two most important pieces of evidence are in the positions where the LLM attends most strongly. It is simple to implement and provides measurable improvement on multi-chunk reasoning tasks.
  • Document-order preservation: If multiple chunks come from the same document, preserve their original order rather than interleaving them by relevance score. A reader (human or LLM) understands a document better when its sections arrive in the order the author intended. Group chunks by document, order the groups by the highest relevance score within each group, and preserve intra-group ordering.
  • Chronological ordering: For time-sensitive queries, order chunks by date (oldest first or newest first, depending on the query intent). When the user asks "How has our return policy changed?", presenting the policies in chronological order gives the LLM a narrative arc that naturally supports accurate summarization.

In practice, document-order preservation within groups combined with sandwich ordering across groups tends to perform best. The most relevant document goes first, the second most relevant goes last, and each document's chunks maintain their natural reading order.

Output formatting: XML tags vs. markdown headers for source attribution

How you format the retrieved chunks within the context window affects both the LLM's ability to distinguish between sources and its ability to cite them in responses. There are three dominant formatting approaches.

  • Plain concatenation: Chunks are joined with a simple delimiter (a newline, a horizontal rule, or a short label like "Source 1:"). This is the lowest-effort approach. It works for simple queries but fails when the LLM needs to attribute claims to specific sources, because the source boundaries are ambiguous.
  • Markdown headers: Each chunk is preceded by a markdown header containing the source metadata: ### Source: API Documentation v2.4 (March 2026, Section 3.2). This provides clear visual (and semantic) boundaries between sources. LLMs trained on markdown-heavy corpora tend to respect these boundaries and can reference them in their answers: "According to the API Documentation v2.4..."
  • XML tags: Each chunk is wrapped in XML elements: <source id="1" title="API Documentation v2.4" date="2026-03">...chunk text...</source>. XML formatting has two advantages over markdown. First, it provides machine-parseable structure - downstream processing can extract which sources the LLM referenced. Second, LLMs (particularly Claude and GPT-4) have been shown to handle XML-structured prompts with high fidelity, maintaining clear source separation even in long contexts.

Our recommendation: use XML tags for production systems where source attribution matters (internal knowledge bases, compliance applications, customer support) and markdown headers for lighter applications where readability in prompt debugging is more important than machine-parseability. Whichever format you choose, always include source metadata - document title, date, and section - not just a numeric ID. The LLM uses this metadata to assess source relevance and recency, improving answer quality independent of the formatting choice.

The Anthropic prompt engineering documentation provides practical guidance on XML-structured prompts, including patterns for multi-source contexts that translate directly to RAG assembly.

Deduplication and redundancy management

Retrieval often returns overlapping or near-duplicate chunks. This happens when multiple versions of the same document exist in the index, when neighbor joining pulls in chunks that were also retrieved independently, or when the same information appears in multiple source documents. Including duplicates in the context wastes tokens and can bias the LLM toward the duplicated information (repetition increases the model's confidence in a claim regardless of its accuracy).

Implement a deduplication step before context assembly:

  • Exact deduplication: Remove chunks with identical text. Use a hash-based check for O(1) detection.
  • Near-duplicate detection: Remove chunks with cosine similarity above 0.95 (using their existing embeddings). Keep the version with the higher relevance score or the more recent source date. This catches cases where the same paragraph appears in slightly different documents - a quarterly report and a summary report, for example.
  • Subsumption detection: If chunk A is contained entirely within chunk B (e.g., because neighbor joining expanded a short chunk into a longer passage that fully contains it), remove chunk A and keep chunk B. This is particularly important when combining neighbor-joined chunks with independently retrieved chunks.

After deduplication, re-rank the remaining chunks. The removal of duplicates may have shifted the relevance distribution, and a chunk that was ranked 8th might now deserve a higher position if the chunks ranked above it were duplicates.

We ran a controlled experiment: same retrieval pipeline, same model, same queries - but we A/B tested naive concatenation vs. XML-tagged context with sandwich ordering and neighbor joining. Factual accuracy on our evaluation set went from 71% to 84%. Attribution accuracy - whether the model correctly cited the source of its claims - went from 23% to 67%. The assembly step was the single highest-leverage improvement we made all quarter.

Putting it all together: the assembly pipeline

A production context assembly pipeline runs these steps in order:

  1. Neighbor joining: Expand retrieved chunks with adjacent chunks from the same document. Merge overlapping expansions.
  2. Deduplication: Remove exact duplicates, near-duplicates, and subsumed chunks. Re-rank.
  3. Token-aware truncation: Fill the token budget using greedy fill and gap fill passes. Truncate partial chunks at sentence boundaries.
  4. Ordering: Apply sandwich ordering across document groups, preserve document order within groups.
  5. Formatting: Wrap each chunk in XML tags or markdown headers with full source metadata.
  6. Contradiction annotation: If the pipeline detects conflicting facts across chunks, annotate them in the context so the LLM can surface the disagreement. For details on building the detection layer, see our post on contradiction detection in RAG knowledge bases.

Each step is independently configurable and testable. You can measure the impact of neighbor joining by comparing retrieval evaluation metrics with and without it. You can measure the impact of ordering strategies by comparing answer accuracy across different orderings. This composable architecture lets you optimize each step without rebuilding the entire pipeline.

Where TypeGraph fits in

TypeGraph's context assembly engine handles neighbor joining, token-aware truncation, ordering optimization, source formatting, and deduplication as configurable stages in the retrieval-to-generation pipeline. Each stage is independently tunable - you can set neighbor window sizes, choose formatting modes, configure ordering strategies, and define token budgets per-query or per-pipeline. The assembly engine integrates with TypeGraph's contradiction detection layer to annotate conflicting evidence before it reaches the LLM. The result is a context window engineered for maximum information density and model comprehension, not just a list of top-k chunks.

Context Window Assembly Strategies: How to Pack the Most Useful Information Into Your LLM's Token Budget | TypeGraph