Why did transformers replace RNNs and LSTMs?

RNNs process sequences one token at a time, making training inherently sequential and slow. Transformers process all tokens in parallel, enabling massive speedups on GPU hardware. Additionally, RNNs suffer from vanishing gradients where information from early tokens fades over long sequences, while transformers can directly attend to any position regardless of distance.

What is the O(n²) cost of transformers?

Self-attention computes relationships between every pair of tokens, requiring n² attention scores for a sequence of n tokens. This quadratic scaling becomes the primary bottleneck for very long contexts—doubling sequence length quadruples attention computation. State space models like Mamba offer O(n) alternatives for efficiency-sensitive applications.

What's the difference between encoder-only and decoder-only transformers?

Encoder-only models like BERT use bidirectional attention where every token can attend to every other token—optimal for understanding tasks. Decoder-only models like GPT use causal masking so tokens can only attend to previous tokens—necessary for generation since you can't look at words you haven't generated yet. Decoder-only won for generative AI because the same architecture handles training and inference.

Transformers: The Architecture That Made AI Work

Before transformers, the dominant approach was recurrent neural networks. RNNs and LSTMs processed text sequentially, one word at a time, passing a hidden state forward through the sequence. Training couldn't be parallelized because each step depended on the previous one. The hidden state suffered from vanishing gradients—information from early in a long sequence faded by the end.

The transformer architecture processes all words simultaneously. Every word examines every other word in parallel. Massively parallel, vastly better at long-range dependencies, and quadratically expensive in sequence length (an O(n²) cost that becomes the bottleneck for extremely long contexts).

But this creates a problem sequential models never had: when you process everything in parallel, position disappears.

Sequential processing has one advantage: position is implicit. Word five comes after word four because that's the order you processed them in. Transformers see a bag of words with no inherent order.

Transformers solve this by injecting position information directly into the data itself. Before any attention computation happens, each word embedding gets added to a positional encoding—a vector representing where in the sequence that word appears.

The original design used sinusoidal functions for this. Different frequency waves create patterns that work like binary numbers: some dimensions change rapidly across positions, others change slowly. According to analysis of positional encoding schemes, this lets the model learn relative positions, not just absolute ones; it can recognize "these two words are three positions apart" regardless of where they appear in the sequence.

This design choice moved the burden of understanding order from the structure of the neural network to the data itself.

Most modern transformers use learned positional encodings instead (BERT, GPT-2 onward). Same principle, but the model learns the optimal representation rather than relying on predetermined functions. Rotary positional encoding (RoPE), used in Llama and many recent models, extends this further by encoding position into the attention computation directly.

Self-Attention: The Core Mechanism

With position encoded, the transformer can focus on the real work: figuring out which words should influence which other words through self-attention.

We've covered the query-key-value mechanism in depth elsewhere, but the summary: each word generates three vectors. The query asks "what am I looking for?" The key describes "what do I contain?" And the value holds the actual information to pass along if that word is deemed relevant.

The attention score between two words is the dot product of one's query with the other's key. High score means high relevance. These scores run through softmax to create a probability distribution, and the output for each word is a weighted blend of all the value vectors.

Multi-head attention runs this process multiple times in parallel. GPT-3 uses 96 attention heads, each learning different relationship types. One head might track subject-verb agreement across a sentence. Another might link pronouns to their referents. A third might capture semantic similarity.

The outputs get concatenated and projected back to the model's working dimension.

This is computationally elegant: instead of one massive attention pass, you get many smaller specialized ones that together capture richer patterns than any single computation could.

What Happens Inside Each Block

A transformer layer isn't just attention. Each block has four key operations, applied in sequence with residual connections wrapping around them:

Layer normalization stabilizes the distribution of values at each step, preventing the wild swings that can destabilize deep networks.

Self-attention (as described above) lets each position gather information from every other position.

Another round of layer normalization.

A feed-forward network operates on each position independently. This is where pattern matching and feature extraction happen. The original transformer expanded embeddings from 512 to 2,048 dimensions in this layer; modern models go wider.

This expansion-contraction pattern gives the network space to detect complex patterns before compressing back to model dimension.

The residual connections are crucial. They let gradients flow directly through the network during training, making it possible to stack dozens or hundreds of layers without the signal degrading completely.

Encoders and Decoders

The original transformer paper proposed an encoder-decoder architecture for translation: the encoder processes the source language bidirectionally (every word attends to every other word), and the decoder generates the target language autoregressively (each word can only attend to previous words).

Modern language models mostly dropped the encoder. GPT-style models are decoder-only: they predict the next token based on all previous tokens, using causal masking to prevent cheating by looking ahead. BERT went the other direction, encoder-only with bidirectional attention, optimized for understanding rather than generation. T5 and similar models kept the full encoder-decoder structure for tasks like summarization and translation.

The decoder-only design won for generative AI because it's simpler and scales better. You just keep predicting the next token, over and over, and the same architecture handles both training and inference.

Why Transformers Enabled the Scaling Era

RNNs couldn't parallelize, so training was slow. Transformers process sequences in parallel, so training got fast. The original paper achieved state-of-the-art translation with 3.5 days of training on 8 GPUs. That was remarkable for 2017.

The deeper impact: scaling laws. Because transformers parallelize so well, researchers discovered that model quality improves predictably as you add parameters and data. You can forecast performance using a fraction of eventual compute.

That predictability is why OpenAI could plan GPT-4 years in advance and why the industry poured billions into training runs.

The architecture is also simple enough to scale without breaking. Add more layers, wider layers, more attention heads. The fundamental operations stay the same.

Architectures with complex recurrent dynamics don't have that property.

Our take: the transformer didn't just work better than what came before; it worked better in ways that compounded. Better parallelization meant faster training. Faster training meant more experiments. More experiments meant better understanding of scaling. Better understanding meant bigger bets. The architecture created a flywheel that's still spinning.

Seven years later, every frontier model remains a transformer or a transformer hybrid. Competitors like Mamba and RWKV have carved out niches for efficiency-sensitive deployments, but no pure alternative has cracked the top of capability rankings. The transformer's dominance isn't just momentum. It's earned.