Conversation Replay for AI Agents: How to Debug What Your Agent Did and Why
Last month, a customer on our enterprise tier reported that their support agent had confidently told a user their subscription included a feature it did not. The agent cited internal documentation, quoted a pricing page, and even referenced a prior conversation. Every individual piece looked plausible. But the conclusion was wrong, and the customer wanted to know exactly how the agent arrived at it.
The engineering team pulled logs. They found the retrieval call, the memory reads, and the final generation. But the logs were scattered across services, timestamps were inconsistent, and there was no way to reconstruct the agent's decision-making sequence in order. They spent three days piecing together a timeline that should have taken three minutes.
This is the conversation replay problem. When an agent fails, you do not need a dashboard of aggregate metrics. You need a frame-by-frame replay of every decision the agent made, in the exact order it made them, with the exact data it had access to at each step.
Why traditional logging fails for agent debugging
Traditional application logging was designed for request-response systems. A user makes an HTTP request, the server processes it, and a response goes back. The causal chain is short and linear. You can usually debug failures by reading a single log stream filtered by request ID.
Agent systems break this model in fundamental ways. A single user query can trigger a cascade of internal operations: multiple memory reads to establish context, several retrieval calls against different knowledge bases, policy evaluations to determine what data the agent can access, tool invocations that may themselves trigger sub-queries, and finally one or more generation calls that synthesize all of it into a response. Each of these operations may be asynchronous, may fan out across services, and may depend on the results of prior operations in ways that are not captured by simple timestamps.
Structured logging helps, but it is not sufficient. Even with JSON-formatted log lines and correlation IDs, you are left with a bag of events that you must mentally reconstruct into a causal sequence. When an agent makes a wrong decision at step 7, you need to see exactly what it knew at step 6 - not what the system's state was at the same wall-clock time, but what data had actually flowed into the agent's context at that specific point in its reasoning chain.
Event-sourced conversation architecture
The solution is to treat every agent conversation as an event-sourced stream. Instead of logging after-the-fact summaries of what happened, you capture each discrete operation as an immutable event in a strictly ordered sequence. The event log becomes the single source of truth for what the agent did, and replay becomes a matter of reading that log from the beginning and stepping through it.
This is not a novel pattern. Event sourcing has been a cornerstone of financial systems, audit-critical applications, and distributed systems for decades. The event sourcing pattern as described by Martin Fowler captures the core idea: store every state change as an event, and derive current state by replaying events in order.
What makes agent systems different is the diversity of event types. In a typical CRUD application, you have a handful of event types: created, updated, deleted. In an agent system, the event taxonomy is richer and more nuanced.
Defining the event taxonomy for agent operations
A well-designed replay system requires a comprehensive event taxonomy that covers every operation an agent can perform. Based on production experience across dozens of agent deployments, we have converged on the following core event types:
- memory.read - Fired when the agent reads from its memory store. The event payload includes the query used to read memory, any filters applied (tenant, user, time range), the results returned, and the latency of the operation. This is critical for debugging because it tells you exactly what context the agent had when it started reasoning.
- memory.write - Fired when the agent writes a new memory or updates an existing one. The payload includes the full content being written, the memory category (episodic, semantic, procedural), any metadata attached, and the entity or relationship being stored. Memory writes during a conversation are often the source of downstream errors: if the agent writes an incorrect summary of a user's statement, every subsequent memory read will retrieve that incorrect summary.
- memory.delete - Fired when a memory is removed, whether by explicit agent action, retention policy enforcement, or user-initiated right-to-be-forgotten requests. Deletions are especially important for debugging because they explain why context the agent previously had access to is suddenly missing.
- query.execute - Fired when the agent executes a retrieval query against a knowledge base or vector store. The payload includes the raw query text, the embedding vector (or a reference to it), any metadata filters, the top-k parameter, the results returned with their similarity scores, and the total candidate count. This event is the key to understanding retrieval quality issues: was the right document in the results but ranked too low, or was it missing entirely?
- tool.call - Fired when the agent invokes an external tool (API call, database query, calculation, code execution). The payload includes the tool name, input parameters, output result, latency, and any errors. Tool calls are often where agent reasoning goes off the rails, especially when tools return unexpected formats or error responses that the agent misinterprets.
- agent.run - A meta-event that represents an entire reasoning cycle. It wraps a sequence of the above events, capturing the initial prompt, the system instructions, the assembled context window, and the final output. This is the top-level event you start with when replaying a conversation, and you drill down into its child events to understand the details.
Payload-level replay: capturing what the agent saw, not what you think it saw
The most common mistake in building replay systems is logging references instead of values. For example, logging "retrieved document ID doc_12345" instead of logging the full text of what was retrieved. When you go to replay the conversation a week later, document doc_12345 may have been updated, re-indexed, or deleted. The replay shows you the current state of the document, not the state the agent actually saw.
Payload-level replay means capturing the full content of every input and output at every step. When the agent reads a memory, you store the complete memory content as it was returned. When a retrieval query returns results, you store the full text of every chunk with its score. When a tool returns a response, you store the entire response body.
This is expensive in terms of storage, but the alternative is a replay system you cannot trust. There are strategies to manage the cost: compress payloads, use tiered storage with hot/warm/cold tiers based on conversation age, and set retention policies that automatically expire replay data after a configurable period. But do not sacrifice payload completeness for storage savings. A partial replay is worse than no replay at all, because it gives you false confidence that you understand what happened.
We had a critical bug where our agent was hallucinating policy details for enterprise customers. Traditional logs showed the retrieval call and the generation, but we could not figure out why the agent was combining two different policies into one answer. Once we built payload-level replay, we could step through the exact sequence: the retrieval returned two overlapping policy documents, and the agent merged them because the context window placed them adjacent to each other. We fixed it by adding deduplication at the retrieval layer. Without replay, we would still be guessing.
Building the replay viewer: from raw events to developer experience
A replay system is only as good as its interface. Raw event logs in JSON format are technically complete but practically useless for debugging under time pressure. The replay viewer needs to present the conversation as a step-by-step timeline that a developer can scrub through like a video player.
The key design elements we have found effective are: a vertical timeline showing each event in sequence with icons indicating the event type (memory read, retrieval, tool call, generation); expandable payload panels that show the full input and output of each event; a diff view for memory writes that highlights what changed; and a context window visualizer that shows exactly what text was assembled and sent to the language model at each generation step.
The context window visualizer deserves special attention. Most agent bugs are context assembly bugs: the right information was retrieved but placed in the wrong order, or irrelevant context diluted the signal, or a critical piece of context was truncated to fit the token limit. Being able to see the exact token sequence sent to the model - with color-coded sections showing which context came from memory, which from retrieval, and which from the system prompt - makes these bugs immediately visible.
Deterministic vs. non-deterministic replay
There is an important distinction between replaying what happened and re-executing what happened. Replaying means reading the recorded event stream and displaying it. Re-executing means feeding the same inputs back through the system and checking whether you get the same outputs.
Deterministic replay - re-execution with output comparison - is the gold standard for debugging, but it is hard to achieve with LLM-based agents. Language model outputs are inherently stochastic (even at temperature 0, different API versions can produce different outputs), tool calls may have side effects, and external APIs may return different results at different times.
The practical approach is to use recorded replay for investigation and deterministic replay only for the components you control. You can deterministically replay memory reads, retrieval queries, and policy evaluations against a snapshot of the data store. For LLM generation steps, you compare the recorded output against a fresh generation and flag any significant divergence. This hybrid approach gives you the debugging power of replay without requiring full determinism.
The OpenTelemetry tracing concepts provide a useful mental model here: each event in your replay log is analogous to a span, and the full conversation is a trace. The difference is that your spans carry full payload data, not just timing and metadata.
Correlating replay with production metrics
Replay becomes even more powerful when you can correlate individual conversation replays with aggregate production metrics. If your observability dashboard shows a spike in retrieval latency at 2:14 PM, you should be able to click into that time window and see the individual conversation replays that were affected. Conversely, if a specific conversation replay shows an unusually slow retrieval step, you should be able to see whether that was an isolated incident or part of a broader pattern.
This bidirectional navigation between aggregate metrics and individual replays is what separates a production-grade debugging system from a collection of logs. It is the same principle that makes distributed tracing tools like Jaeger and Datadog APM valuable: the ability to move fluidly between the macro view (what is happening across the system) and the micro view (what happened in this specific request).
Retention, storage, and privacy considerations
Payload-level replay data is rich and potentially sensitive. Every memory read, every retrieval result, and every tool response may contain user data, proprietary content, or personally identifiable information. Your replay system needs the same access controls, encryption, and retention policies that you apply to the agent's memory store itself.
In practice, this means: encrypt replay data at rest and in transit; apply tenant-level access controls so that replay data for one customer's conversations is not visible to another customer's administrators; implement configurable retention periods (30 days is a reasonable default for debugging, longer for compliance-critical workloads); and provide a mechanism to purge replay data for specific users in response to data deletion requests.
The storage cost of payload-level replay is non-trivial. A single agent conversation with 5 memory reads, 3 retrieval queries, and 2 tool calls can generate 50-100 KB of replay data. At scale, this adds up. Compression (gzip typically achieves 5-10x on JSON payloads), tiered storage, and selective recording (e.g., only record full payloads for conversations that trigger error conditions or user complaints) are all effective strategies for managing costs without sacrificing debugging capability.
How TypeGraph approaches conversation replay
TypeGraph's architecture treats every agent operation as a first-class event in an ordered stream. Memory reads, writes, and deletes; retrieval queries and results; policy evaluations; and tool invocations are all captured with full payloads as part of the normal operation pipeline, not as an afterthought bolted onto logging. The replay viewer provides a step-by-step timeline of any conversation, with expandable payloads and context window visualization, so that when an agent makes a mistake you can understand exactly why within minutes rather than days.
For teams building agent systems that need to operate reliably in production, conversation replay is not a nice-to-have. It is the difference between debugging by intuition and debugging by evidence. The investment in building it pays for itself the first time a critical agent failure needs to be explained to a customer, a compliance officer, or your own engineering team.