Self-Attention

A mechanism where every token in a sequence computes relevance scores against every other token, allowing the model to weigh which parts of the input matter most for each position.

Self-attention is the core operation inside transformer models. Each token produces a query, key, and value vector. The query is compared against all keys via dot product to produce attention scores, which are normalized through softmax and used to create a weighted sum of values. This allows each token to dynamically focus on the most relevant parts of the input sequence, regardless of distance.

Also known as

scaled dot-product attention, attention mechanism