Transformer

A neural network architecture based on self-attention that processes all tokens in parallel, replacing sequential RNNs and enabling the current generation of large language models.

Introduced in the 2017 paper "Attention Is All You Need," the transformer architecture uses self-attention mechanisms to model relationships between all positions in a sequence simultaneously. This parallel processing capability, combined with multi-head attention and feed-forward layers, enabled dramatic training speedups over RNNs and became the foundation for GPT, BERT, Claude, and virtually all modern language models.

Also known as