Agent MemoryMemory ArchitectureWorking MemoryContext Windows

Working Memory and Context Windows: Why Your Agent Forgets Mid-Task

Ryan Musser
Ryan Musser
Founder

The seat of conscious thought

Working memory is the bottleneck where everything you are thinking about right now sits. The phone number you are about to dial. The argument you are mid-sentence on. The five lines of code you are mentally tracing. It is small, it is fast, and it is the closest thing the brain has to a CPU register file. When people say "the model lost track of what we were doing," they usually mean working memory failed.

The biology

The dominant model is Baddeley's multicomponent model, originally proposed in 1974 and updated in 2000. It splits working memory into four cooperating subsystems:

  • The central executive (housed in dorsolateral prefrontal cortex) directs attention and decides what the other components work on.
  • The phonological loop holds verbal information through inner-speech rehearsal for about two seconds. This is why repeating a phone number to yourself keeps it alive.
  • The visuospatial sketchpad holds spatial and visual information.
  • The episodic buffer (added in 2000) ties everything together, integrating bits from working memory and long-term memory into coherent multi-modal "scenes."

The capacity is famously tiny. Miller's 1956 paper put it at 7 plus or minus 2 chunks. Cowan refined that estimate down to about 4 chunks. Sustained activity in prefrontal cortex and gamma oscillations (30 to 100 Hz) are what keep representations "online" while the central executive shuffles them around. The phonological loop and visuospatial sketchpad are independent, which is why you can rehearse a phone number while still navigating a parking lot.

The important thing about working memory is not just that it is small. It is that the contents are actively maintained. If the maintenance stops, the contents are gone in seconds. This is fundamentally different from long-term storage.

The technology

The LLM context window is the working memory analog. It has scaled stupendously, from 512 tokens in GPT-2 (2019) to 128K in GPT-4 Turbo, 2M in Gemini 1.5 Pro, and 10M in Llama 4 (2025). But quantity does not equal quality. The well-known "lost in the middle" effect (Liu et al., 2023) shows that information at the very beginning and very end of a long prompt is recalled with 85 to 95% accuracy, while material in the middle drops to 76 to 82%. This is the same primacy-recency effect biological psychologists have measured for decades.

Three modern engineering tricks parallel different parts of Baddeley's model:

  • KV caches store intermediate key-value computations during autoregressive inference. They are functionally analogous to the substrate of working memory: the medium that holds the active state. Recent research like Entropy-Guided KV Caching (2025) allocates cache budgets based on attention entropy, which is biologically reasonable.
  • Self-attention serves as the central executive, computing relevance scores between all the things currently held in context. A 2024 paper from Kozachkov et al. (PNAS) showed that a biophysical neuron-astrocyte network can theoretically implement transformer self-attention, providing the first plausible biological account of how attention might work.
  • Chain-of-thought reasoning functions like the phonological loop. The model "verbalizes" intermediate steps in plain text, holding information through explicit token generation rather than subvocal rehearsal.

The most ambitious working-memory implementation in production today is Letta (formerly MemGPT). Letta creates three tiered memory regions: core memory blocks (always-in-context, around 2K characters, equivalent to Cowan's four chunks), recall memory (searchable conversation history), and archival memory (vector database). When pressure on the context reaches roughly 70% of capacity, the agent autonomously summarizes and pages out less critical content. Sleep-time consolidation runs asynchronously, achieving 18% accuracy gains and a claimed 2.5x cost reduction per query.

Google's Infini-attention (Munkhdalai et al., 2024) splits computation into local attention (current context, like the visuospatial sketchpad) and global linear attention (compressive memory of the entire past, like the episodic buffer), achieving 114x less memory than standard attention.

Where the gap is

Context windows give us functional working memory but lack the active manipulation that makes the human version so flexible. Baddeley's central executive does not just store, it operates on the contents. It decides which subsystem holds what. It rehearses, refreshes, and discards. Most LLMs do not have a real architectural separation between a "phonological" verbal store and a "visuospatial" structural store.

Multimodal models like GPT-4o or Gemini are starting to look more like Baddeley's model in spirit, but the integration is implicit, not architectural. There is no clean separation between the two streams that would let the model rehearse one while computing on the other in parallel.

Practical implication for agent builders: do not assume that more tokens equals more usable working memory. If the middle of your prompt is silently being ignored, no amount of additional context will help. Either chunk and route smaller pieces (Letta-style tiered memory), summarize aggressively, or use techniques like reranking to pull the most relevant bits to the edges of the window.

Series footer

← Previous: Sensory Memory · Series anchor · Next: Attention Gating →

Working Memory and Context Windows: Why Your Agent Forgets Mid-Task | TypeGraph