Agentic Retrieval: Why Your AI Agent Shouldn't Use the Same Search Query the User Typed
The gap between what users ask and what retrievers need
A user types: "Compare our Q3 revenue guidance with what analysts predicted, and flag any risks mentioned in the last board meeting." A traditional RAG pipeline takes that entire string, embeds it, and runs a single vector search. The results are predictably mediocre - a few documents vaguely related to revenue, nothing about analyst predictions, and the board meeting notes buried under irrelevant matches.
The problem isn't the retriever. It's that we're asking a retrieval system optimized for single-topic similarity to handle a compound, multi-faceted question. The user's natural language query is optimized for human communication, not for machine retrieval. Bridging this gap is what agentic retrieval is about.
Query decomposition: breaking compound questions apart
The first step in agentic retrieval is recognizing that most real-world questions contain multiple implicit sub-queries. The example above decomposes into at least three:
- What was our Q3 revenue guidance?
- What did analysts predict for our Q3 revenue?
- What risks were discussed in the most recent board meeting?
Each sub-query targets a different document set, requires different metadata filters, and may benefit from different retrieval weights. Research from Least-to-Most Prompting and chain-of-thought decomposition shows that breaking complex questions into simpler components dramatically improves downstream accuracy.
The decomposition itself can be handled by an LLM - you prompt it with the original question and ask it to output a structured list of sub-queries. The key insight is that this decomposition step costs a few hundred tokens and a couple hundred milliseconds, but it can improve retrieval recall by 30-50% on compound queries.
Multi-signal routing: picking the right retrieval method per sub-query
Not every sub-query benefits from the same retrieval approach. "What was our Q3 revenue guidance?" is best served by a hybrid search combining semantic similarity with keyword matching on "Q3" and "revenue guidance." The board meeting query benefits from metadata filtering on document type and date. Analyst predictions might require searching an external data source entirely.
Agentic retrieval treats retrieval as a toolkit, not a single pipeline. The agent decides, per sub-query, which combination of weights to activate: vector search, keyword search, knowledge graph traversal, structured metadata filters, or even external API calls. This is fundamentally different from the static "embed → search → stuff" pattern that dominates most RAG implementations.
Single-shot vs. iterative vs. tool-use retrieval
There are three distinct architectures for agentic retrieval, each with concrete tradeoffs:
- Single-shot retrieval decomposes the query, executes all sub-queries in parallel, and merges results. Latency is low (one round trip), but the agent can't adapt based on intermediate results. If the first retrieval reveals that the board meeting was actually in a different quarter, there's no chance to refine.
- Iterative retrieval executes sub-queries sequentially, allowing each step to inform the next. The agent reviews results from sub-query 1, decides whether to refine sub-query 2, and adapts its strategy. Latency is higher (multiple round trips), but accuracy improves significantly for questions where context from early results changes what you need to search for next.
- Tool-use retrieval exposes retrieval as a callable tool (or set of tools) that the agent invokes as needed during its reasoning process. This is the most flexible pattern - the agent might search, read a result, decide it needs more context, search again with different parameters, and repeat until it has enough information. Frameworks like Vercel AI SDK and LangGraph make this pattern increasingly accessible.
The latency-quality tradeoff in practice
In our benchmarks across customer workloads, we see a clear pattern: single-shot retrieval adds ~200ms of overhead for query decomposition but handles 70% of queries well. Iterative retrieval adds 500-1500ms but improves answer quality by 15-25% on complex, multi-part questions. Tool-use retrieval is the most variable - sometimes it completes in one step (matching single-shot latency), sometimes it takes 5+ steps for deeply complex queries.
The practical approach is to classify queries by complexity and route them to the appropriate strategy. Simple factual questions go through single-shot. Multi-part analytical questions use iterative. Open-ended research questions use tool-use. This classification itself can be done by a lightweight model or even a rule-based system that counts the number of entities and question marks in the input.
After switching from static retrieval to agentic query decomposition, our answer accuracy on compound questions went from 54% to 81%. The latency increase was negligible for our use case - users were already waiting 3-5 seconds for a thoughtful answer.
Implementation considerations
If you're building agentic retrieval into your pipeline, start with query decomposition alone - it's the highest-impact change with the lowest implementation cost. Add multi-signal routing when you notice that different sub-queries consistently benefit from different retrieval methods. Move to iterative or tool-use patterns only when your queries genuinely require multi-step reasoning.
At TypeGraph, our retrieval API supports all three patterns through composable query weights - vector, keyword, graph, and memory - that agents can combine and sequence as needed. The MCP server interface makes retrieval a tool that any agent framework can invoke, turning static pipelines into dynamic, agent-driven search.