What is self-attention in transformers?

Self-attention is the mechanism where each token in a sequence generates query, key, and value vectors, then computes relevance scores against every other token using dot products. The scores are normalized through softmax and used to create a weighted blend of values, letting each token dynamically focus on the most relevant parts of the input regardless of distance.

Why did transformers replace RNNs?

Transformers replaced RNNs primarily because attention allows full parallelization. RNNs process tokens sequentially, with each hidden state depending on the previous one, making them impossible to parallelize. Transformers compare all tokens simultaneously, enabling massive GPU parallelism that turned weeks of RNN training into days. This parallelization advantage also changed the economics of scaling, allowing compute to translate into capability almost linearly.

What is the quadratic cost problem with attention?

Attention computes pairwise interactions between every token in a sequence, creating an n×n matrix at every layer. This means compute and memory scale as O(n²) with sequence length. At 2,048 tokens this was manageable, but at 100,000+ tokens it becomes a serious bottleneck. FlashAttention helps with a 2-4x speedup through memory tiling, but doesn't change the fundamental quadratic scaling.

Can sub-quadratic models replace transformers?

Not yet at frontier scale. Models like Mamba, RWKV, and RetNet offer O(n) scaling and perform well at small sizes (under 1.5B parameters). But no pure sub-quadratic model has cracked the top 10 on LMSYS Chatbot Arena, and they vanish from leaderboards at 14B-70B scale. The field is converging on hybrid architectures that use mostly linear attention with a few full attention layers as anchors for tasks requiring exact recall.

Why is attention O(n²) and why does it matter?

Attention computes pairwise interactions between every token, creating an n×n matrix at each layer. For 2,048 tokens this is manageable, but at 100,000+ tokens the quadratic cost in compute and memory becomes a genuine bottleneck for both training and inference.

Can sub-quadratic architectures replace attention?

Not at frontier scale. While models like Mamba and RWKV outperform transformers under 1.5B parameters, no pure sub-quadratic model has cracked the top 10 on major benchmarks at scale. The field is converging on hybrid architectures that use cheap linear attention for most layers with full attention anchors for tasks requiring exact recall.

What are hybrid attention architectures?

Hybrid architectures use a mix of full attention and cheaper linear attention layers, typically in a 3:1 ratio of linear-to-full. Most of the network runs cheap linear attention for local pattern matching, while a few full attention layers serve as anchors for tasks requiring exact recall over long contexts. Models like Qwen3-Next and Kimi Linear use this approach.

Why is the √d scaling factor important in attention?

The dot products in attention are scaled by the square root of the key dimension (√64 in the original transformer) to keep gradients stable. Without this scaling, dot products in high-dimensional space grow large enough to push softmax into saturation, where nearly all probability mass concentrates on one token and gradients vanish.

Self-Attention: The Engine Behind Every Frontier AI Model

Q: What is the difference between encoder and decoder attention?

Encoder models like BERT use bidirectional attention where every token attends to every other token in both directions. Decoder models like GPT use causal masking, where each token can only attend to tokens that came before it. This masking is what makes autoregressive generation possible; the model can't look ahead at future tokens.

Headline: Self-Attention: The Engine Behind Every Frontier AI Model

Every modern AI model you use, from GPT-4 to Claude to Gemini, runs on a single mechanism: attention. Specifically, self-attention, the idea that every token in a sequence should be able to look at every other token and decide what's relevant. It's the load-bearing wall of the entire transformer architecture. And it's simpler than most people think.

The core intuition: a soft lookup table

Strip away the linear algebra and attention is a question-answering system. Each token generates three vectors: a query ("what am I looking for?"), a key ("what do I contain?"), and a value ("what information should I pass along if selected?"). The query from one token is compared against the keys of every other token using a dot product. High dot product means high relevance. Those scores get normalized through softmax to create a probability distribution, and the final output is a weighted blend of all the values.

Jay Alammar's walkthrough of the transformer lays this out clearly: each input embedding (512 dimensions in the original paper) gets multiplied by three learned weight matrices (WQ, WK, WV) to produce query, key, and value vectors of 64 dimensions each. The dot products are scaled by √64 to keep gradients healthy. Without that scaling factor, dot products in high-dimensional space grow large enough to push softmax into saturation, where nearly all the probability mass concentrates on one token and gradients vanish.

That's the mechanism running every frontier model on the planet.

What multi-head attention does

One attention pass learns one type of relationship. The original transformer runs eight parallel attention heads, each with its own set of Q/K/V weight matrices. One head might learn syntactic dependencies (subject-verb agreement across a sentence). Another might track coreference (linking "she" to "Dr. Chen" from three paragraphs back). A third might handle positional proximity.

The outputs of all eight heads get concatenated and compressed back to model dimension through a final weight matrix. This is computationally elegant: instead of one massive attention pass, you get eight smaller, specialized ones that together capture richer patterns than any single pass could.

Modern models scale this further. GPT-class models use dozens of heads per layer across dozens of layers. But the principle is identical to the 2017 original.

Why attention replaced RNNs

Before transformers, sequence modeling meant recurrent neural networks. LSTMs, GRUs, and their variants processed tokens one at a time, threading a hidden state from each step to the next. This worked, but it had a fatal flaw: you couldn't parallelize it. Each hidden state depended on the previous one, which meant training was inherently sequential.

Attention threw that away. Because every token's query is compared against every key simultaneously, the entire sequence can be processed in parallel on GPUs. Training speedups were immediate and dramatic. The same hardware that took weeks to train an RNN could train a transformer in days.

The architecture also splits into two flavors based on which tokens can see which. Encoder models (like BERT) use bidirectional attention, where every token attends to every other token in both directions. Decoder models (like GPT) use causal masking, where each token can only attend to tokens that came before it. This masking is what makes autoregressive generation possible; the model can't cheat by looking at the answer.

Our take: the parallelization advantage wasn't just a training speedup. It fundamentally changed the economics of scaling. RNNs hit a wall where throwing more hardware at them yielded diminishing returns. Transformers turned compute into capability almost linearly, which is why the scaling laws that define the current era of AI are built on attention.

Quadratic scaling as bottleneck

Attention computes pairwise interactions between every token in the sequence. For a sequence of length n, that's an n×n matrix at every layer. The cost is O(n²) in both compute and memory. When context windows were 2,048 tokens, this was manageable. At 100,000+ tokens, it's a genuine bottleneck.

FlashAttention addressed this at the hardware level through clever memory tiling, avoiding the need to materialize the full n×n matrix in GPU high-bandwidth memory. This also reduces pressure on the KV cache, which stores key and value vectors for previously processed tokens. The result was 2-4x speedups in practice. But FlashAttention is an optimization, not a solution; the fundamental scaling is still quadratic.

This created a real opening for alternatives.

The challengers: sub-quadratic architectures

Mamba (based on state-space models), RWKV, RetNet, and various linear attention schemes all promise O(n) scaling. Process tokens in linear time, keep constant memory per token, and ditch the quadratic matrix entirely. On paper, this is exactly what the field needs for million-token context windows.

The arXiv survey on sub-quadratic architectures tells a more complicated story. At small scale (under 1.5B parameters), these alternatives genuinely work. RWKV7-World3 at 1.5B parameters scored 43.3 on MMLU and 78.1 on ARC-E, significantly outperforming Llama 3.2 1B's 32.1 and 67.0 on the same benchmarks. Samba showed similar advantages in this weight class.

But at frontier scale, the picture changes completely. No pure sub-quadratic model has cracked the top 10 on LMSYS Chatbot Arena. As you move into the 14B-70B range, pure alternatives vanish from leaderboards entirely. Only hybrids like Jamba (52B) remain competitive, and even those don't surpass well-optimized transformers.

The most telling data point: MiniMax built their M2 model on linear attention, then reportedly reverted to full attention after production quality degraded. When real users and real workloads are on the line, the quadratic cost of full attention keeps winning.

The emerging compromise: hybrid architectures

The field is converging on a pragmatic middle ground. Models like Qwen3-Next and Kimi Linear use hybrid designs with roughly 3:1 ratios of linear-to-full attention layers. Most of the network runs cheap linear attention for local pattern matching, but a few full attention layers serve as "anchors" for tasks requiring exact recall over long contexts.

The survey's analysis identifies two hybrid patterns gaining traction: striped architectures (alternating between attention types layer by layer) and fusion architectures (running different attention types in parallel within a single layer). Both segment memory into specialized components that handle different temporal scales.

Our read: this is what "replacing attention" actually looks like. Not a clean swap, but a gradual reduction. Keep full attention where it matters (associative recall, long-range dependencies) and use cheaper primitives everywhere else. The question isn't whether attention will be dethroned. It's how much of it you can strip out before accuracy degrades.

Where this is heading

Attention's position is paradoxical. It's simultaneously the most important and most expensive idea in modern AI. Every serious attempt to replace it has ended up compromising with it instead. The sub-quadratic alternatives aren't failures; they're genuine advances for edge deployment and resource-constrained settings. But at the frontier, where the best models are trained, full attention remains the foundation.

The real shift happening now isn't architectural; it's economic. As context windows push toward millions of tokens and inference costs dominate deployment budgets, the pressure to reduce attention's footprint will only grow. Hybrid architectures are the first credible response. They won't make headlines the way "transformer killer" papers do, but they're the ones actually shipping in production.

Attention isn't going anywhere. The interesting question is how little of it turns out to be enough.