Procedural Memory: The Skills Layer AI Agents Are Just Starting to Build

The system that runs without you

You can ride a bike without thinking about it. You drive home from work and arrive without remembering the turns. You can touch-type a familiar password before your conscious mind has caught up. None of this lives in semantic or episodic memory. It is procedural memory, the system that stores skills, habits, and motor sequences, and it operates almost entirely below conscious awareness.

The biology

Procedural memory has its own neural circuit, distinct from declarative memory:

The basal ganglia (especially the striatum) handle habit formation and reward-based action selection. This is the system that learns "if X, then do Y" through repetition.
The cerebellum manages motor coordination and error correction.
The supplementary motor area plans complex sequences.

The Fitts and Posner three-stage model (1967) describes how a skill matures:

Cognitive stage: deliberate, attention-demanding, slow. You think about every shifted gear when you first drive a stick shift.
Associative stage: smoother, fewer errors. You no longer have to look at the pedals.
Autonomous stage: automatic, minimal attention required. You drive while holding a conversation.

The clinical evidence for the procedural / declarative split is striking. Patient H.M., who had bilateral hippocampal lesions, could not form new declarative memories. He could not tell you what he had done that morning. But he could learn new motor skills (mirror tracing, for example) and improve at them day over day, while consistently denying any memory of having practiced. That experimental result is one of the cleanest demonstrations in cognitive neuroscience that procedural and declarative memory are separate systems.

The technology

Procedural memory is the least mature of the three long-term memory types in AI. The first dedicated frameworks appeared only in late 2025:

Mem^p (Fang et al., 2025) is the first dedicated procedural memory framework. It treats procedures as a first-class optimization object with Build / Retrieve / Update lifecycle operations. Mem^p distills agent trajectories into reusable step-by-step instructions and higher-level scripts. On TravelPlanner and ALFWorld benchmarks, agents using Mem^p achieve steadily higher success rates, and procedural memory transfers from stronger to weaker models, meaning GPT-4o's procedures can substantially improve weaker model performance.
LangMem implements procedural memory through prompt optimization: its create_prompt_optimizer analyzes successful and unsuccessful interactions and updates system prompts. This mirrors the cognitive-to-autonomous transition: the agent's "instructions" become more refined and less explicit over time.
MemOS introduces tool memory for agent planning, distilling when, where, and how to use various tools.
MACLA (arXiv:2512.18950, Dec 2025) uses Bayesian selection and contrastive refinement of hierarchical procedural memory, achieving 78.1% average accuracy with 2,800x faster memory construction than parameter-training baselines.
CodeMem argues that the optimal procedural memory representation is executable code, not natural language. That argument is gaining ground. If the procedure can be expressed as a function, executing the function is more reliable than describing it to an LLM.

Fine-tuned models serve as "compiled" procedural knowledge: initially explicit prompt knowledge becomes embedded in weights, faster and more automatic. RL policies (like those Anthropic and others are publishing on for tool use) directly implement procedural memory through reward-based state-to-action mappings, paralleling the basal ganglia's role.

Where the gap is

Procedural memory is the most under-built memory component in production agents today. Most agents still rely on hand-crafted prompt templates and tool descriptions. There is no standard architecture. The frameworks that exist (Mem^p, MACLA, ReMe) appeared only in late 2025 and are still finding their feet.

The deeper gap is the cognitive-to-autonomous transition. Humans can learn a skill so thoroughly that conscious instruction stops helping (try teaching an experienced driver to drive: they cannot articulate most of what they do). LLM agents do not really have that progression. Their "skills" are either explicit (a prompt template, a function call) or implicit (in the weights), with no smooth path between the two. The Fitts and Posner model has not been mapped well into AI yet.

Practical implication: if you find yourself rewriting prompts and tool descriptions over and over to handle "the agent forgot how to do X again," procedural memory is what you are missing. Mem^p, LangMem prompt optimization, and tool-memory libraries are the early production options worth experimenting with.

Series footer

← Previous: Semantic Memory · Series anchor · Next: Memory Encoding →

Procedural Memory: The Skills Layer AI Agents Are Just Starting to Build

The system that runs without you

The biology

The technology

Where the gap is

Series footer

More Posts

What We Learned Testing Embedding Dimensions and pgvector halfvec for RAG

From Human Memory to Machine Memory: A Field Guide to AI Memory Architecture

Sensory Memory: The Quarter-Second Buffer Behind Whisper and Kafka