Why can't state space models do perfect recall?

SSMs compress the entire sequence history into a fixed-size hidden state. When you squeeze unlimited information into a finite representation, something gets lost. This is fine for tasks like language modeling where general patterns matter more than specific details, but fails when you need to retrieve exact information from earlier in the sequence, like a phone number mentioned 50,000 tokens ago.

What is the optimal SSM-to-attention ratio in hybrid architectures?

Research from multiple groups has independently converged on ratios between 3:1 and 10:1 SSM-to-attention layers. This means 75-90% of layers use efficient SSM processing for local patterns, while sparse attention layers handle global reasoning that requires seeing the full context. AI21's Jamba and NVIDIA's Nemotron 3 follow this pattern.

Do SSMs beat transformers at any scale?

Yes, but only at small scales. Research shows pure SSMs like RWKV-7 outperform transformers at 0.7-1.5B parameters. However, at frontier scale (hundreds of billions of parameters), no sub-quadratic model cracks the top benchmarks. Full attention remains essential for the most capable models.

State Space Models: What Attention Can't Do

Transformers have a scaling problem. Self-attention compares every token to every other token, which means compute grows quadratically with sequence length. Double your context window, quadruple your cost. FlashAttention exists for a reason: clever memory tricks that make long context possible, but the fundamental math remains stubbornly unchanged.

State space models take a different path entirely. Instead of comparing everything to everything, they compress sequence history into a fixed-size hidden state. Process each token, update the state, move on. You get O(n) complexity instead of O(n²). Linear scaling instead of quadratic.

Mamba, released in December 2023, made this practical. According to the original paper, it achieves 5× higher inference throughput than Transformers of equivalent size, and Mamba-3B matches Transformers twice its size on language modeling benchmarks. The efficiency gains hold up to million-length sequences.

So why isn't everything running on SSMs?

SSMs can't remember phone numbers

The tradeoff is fundamental: when you compress an entire conversation into a fixed-size state, you're necessarily losing information. That's fine for many tasks. You don't need perfect recall of every word to follow a narrative or continue a coding pattern. But some tasks demand exact retrieval. What was the phone number mentioned 50,000 tokens ago?

A Technical University of Munich survey documents this clearly: SSMs suffer from finite state capacity constraining information recall. This isn't a bug to fix with better training or more parameters. It's architectural. The Goomba Lab analysis puts it bluntly: Transformers excel at tasks requiring perfect recall and fine-grained token manipulation. SSMs cannot.

The Mamba paper's own benchmarks show the pattern. Strong performance on language modeling perplexity; weaker on in-context learning. The model knows how to continue text but struggles when it needs to retrieve specific information from earlier in the sequence.

From control theory to language

State space models come from control theory, not NLP. The core idea is a continuous-time system that evolves a hidden state based on inputs: the state evolves according to h'(t) = Ah(t) + Bx(t), where A and B are matrices that determine how the current state and input combine. The output y(t) = Ch(t) + Dx(t) reads from that state. The Gradient's explainer walks through the math in detail.

Traditional SSMs apply identical transformations to every token. Mamba's innovation is making those matrices input-dependent. The A, B, C, and D parameters become functions of the current token, allowing the model to selectively filter information based on what it's seeing. Relevant context gets preserved; irrelevant context gets compressed away. This selective mechanism is what makes Mamba competitive with Transformers on many tasks. It's also what enables the efficiency: because each token only interacts with the compressed state rather than the full history, you get constant memory per token regardless of sequence length.

RWKV, another O(n) architecture released in May 2023, takes a different path. It uses a linearized attention form with time-decay factors and gating. The clever bit: it parallelizes like a Transformer during training but runs like an RNN at inference.

Best of both worlds. RWKV performs on par with similarly-sized Transformers despite the linear complexity.

The most interesting development in SSMs circa 2025 isn't any single model. Multiple research groups have independently converged on the same answer: hybrids.

AI21's Jamba, IBM's Granite 4.0, and NVIDIA's Nemotron 3 all do essentially the same thing: 75-90% SSM layers for local pattern recognition, plus sparse attention layers on top for global reasoning.

According to AI21, Jamba achieves 3× throughput on long contexts compared to Mixtral 8x7B, while fitting up to 140K context on a single GPU. The Goomba Lab research found optimal hybrid ratios between 3:1 and 10:1 SSM-to-attention layers, verified across multiple research groups. This ratio keeps showing up independently.

When different teams arrive at the same solution, it usually means they've found something real.

Our read: this is exactly what you'd expect from the architectural tradeoffs. SSMs excel at byte-level modeling and high-resolution continuous data (genomics sequences, audio waveforms) where you need to process massive input efficiently. Transformers excel when you need the full context visible simultaneously. Hybrids give you both.

Scale changes everything

At small scales, pure SSMs now beat Transformers. The TUM survey found that at 0.7-1.5B parameters, RWKV-7 and Samba significantly outperform full attention variants.

But at frontier scale, the picture reverses.

No sub-quadratic model cracks the top 10 on the LLMSys leaderboard. Full attention remains central to every model that matters. This suggests SSMs aren't a replacement for Transformers at the frontier; they're a specialization. Better for edge deployment where memory constraints are tight. Better for streaming inference where you can't afford to recompute attention over growing context. Better for domains with inherently local structure that doesn't require global reasoning.

The March 2025 release of RWKV-7 "Goose" added another data point: the paper claims it surpasses TC0 computational complexity limitations that constrained previous versions. Both Transformers and SSMs share fundamental expressivity constraints at the theoretical level. Whether that matters in practice is still being tested.

The hybrid architecture pattern is probably here to stay. The open questions are about ratio and integration: how many attention layers, where to place them, how to make them cooperate with SSM layers efficiently. Context windows will keep growing, and as they do, the O(n²) cost of attention becomes increasingly painful. Even with FlashAttention's memory optimizations, you're still paying quadratic compute. SSM layers offer a way to handle the bulk of sequence processing at linear cost, reserving attention for the parts that actually need it.

The real test will be whether hybrid architectures can match pure Transformer performance at frontier scale. So far, they haven't. But the efficiency advantages are compelling enough that research continues. If someone cracks that problem, inference costs drop dramatically.

Understanding SSMs matters because they reveal what attention is actually good at: global reasoning over the full context, perfect recall, fine-grained token manipulation. Everything else might be compressible. How much? That's still an open question.