Ask an AI agent to remember a preference you mentioned three hours ago, and you'll watch it fail in interesting ways. Not because it's stupid, but because it's been designed to forget.
The instinct from providers has been to throw tokens at the problem. Context windows have ballooned from 4K to 128K to millions of tokens. But as we've covered before, there's a ceiling: self-attention scales quadratically, KV cache bandwidth creates memory bottlenecks, and costs explode faster than capability. Bigger context isn't the same as better memory. A comprehensive survey from researchers at Rutgers and elsewhere argues the field needs to stop treating memory as a side effect of context length and start treating it as a first-class architectural primitive.
The paper makes a compelling case: the simplistic long-term vs. short-term memory binary doesn't map to what agents actually need. They propose a more useful taxonomy built around three distinct types of memory, each with its own formation, evolution, and retrieval lifecycle.
Factual memory stores declarative facts. The user prefers dark mode. The codebase uses TypeScript. The API key expires in March. Relatively straightforward to implement and retrieve.
Experiential memory captures procedural and case-based learning. When the agent tried approach X and it failed, that's experiential memory. When it learned that a particular user responds better to concise explanations, that's experiential memory too. This is harder to encode and retrieve because the "right" prior experience depends heavily on context.
Working memory is task-specific scratch space: the current goal, intermediate reasoning steps, temporary variables. Fast and ephemeral, not persistent.
The insight worth sitting with: these aren't variations on a single "memory" concept. They're fundamentally different systems with different formation patterns, different update frequencies, and different retrieval strategies. Treating them identically (which is what you get when you just dump everything into a context window) means optimizing for none of them.
Even Infinite Context Wouldn't Solve This
The Agentic RAG survey documents why fixed-pipeline retrieval breaks down for agents doing multistep reasoning. Traditional RAG assumes you know what to retrieve before you start. But agents discover what they need as they work. They iterate, backtrack, and dynamically adjust their retrieval strategy based on what they find.
There's also the "lost in the middle" problem. Models struggle to use information buried in the middle of long contexts. Just having the data in the window doesn't mean the model will attend to it effectively.
And then there's cost.
A production memory layer from Mem0, a Y Combinator-backed startup, reports 90% token reduction while maintaining accuracy, along with 91% faster response times compared to full-context approaches.
That's not a marginal improvement. It's the difference between viable and unviable at production scale.
Two Camps, Different Priorities
The solutions split between research systems that prioritize cognitive fidelity and production systems that prioritize latency and cost.
On the research side, EM-LLM takes inspiration from how human memory actually works. It uses Bayesian surprise to identify event boundaries (moments of unexpectedness that mark natural cognitive transitions). The system shows strong correlation with human-annotated event boundaries from podcast studies. It works as a plug-and-play addition to any transformer, requires no fine-tuning, and can handle contexts up to 10 million tokens while outperforming full-context models.
A-Mem applies Zettelkasten-style linking to create self-organizing knowledge networks. Memory notes auto-generate contextual descriptions and dynamic links that evolve as new information arrives. The results are striking: 85-93% token reduction (1,200 vs. 16,900 tokens per operation) while actually improving accuracy. A 35% F1 improvement on conversational benchmarks. 2x better performance on multi-hop reasoning. Retrieval stays efficient even at over a million memory entries.
Production-focused systems like Mem0 organize memory into three tiers: user-level (long-term preferences), session-level (conversation context), and agent state (current interaction). Their benchmarks claim 26% higher accuracy than OpenAI's memory implementation.
The common thread across these approaches is treating memory as a core system component rather than an emergent property of larger context. This means explicit lifecycle management: how memories form (extraction, encoding), how they evolve (consolidation, forgetting), and how they're retrieved (query construction, timing, post-processing). It means different storage mechanisms for different memory types: token-level storage, parametric memory encoded in model weights, latent representations in hidden states.
Our read: the field is converging on cognitive science as a design source. Event segmentation, associative linking, tiered retrieval. These aren't arbitrary engineering choices. They're patterns that emerged from how biological memory systems solve the same problems. The practical implication for builders: if you're designing agentic systems, memory needs to be in your architecture from the start. Bolting it on later means fighting the accumulated technical debt of decisions made without memory in mind.
Context vs. Memory (and Where the Weight Goes)
The distinction between "context" and "memory" is getting sharper. Context is what the model can see right now. Memory is how the system decides what's worth putting there.
Production memory layers are likely to become standard infrastructure, the way vector databases did for RAG. The economics are too compelling: 90%+ token savings while maintaining or improving accuracy. The research direction points toward more human-like memory dynamics; surprise-based boundaries, self-organizing associations, graceful forgetting. The goal isn't to remember everything. It's to remember the right things at the right time.
The genuinely open question: how much of this can happen inside model weights versus external systems? Current approaches are mostly external orchestration around frozen models. If future models develop genuine parametric memory, the architecture could shift again entirely.