Self-Attention

A mechanism where each token in a sequence computes attention weights over all other tokens, enabling the model to capture dependencies regardless of distance.

Self-attention works by having each token generate query, key, and value vectors. The query from one token is compared against all keys via dot product to produce attention scores, which are normalized through softmax to create a weighted combination of values. This allows every position to directly attend to every other position, making it the core computational primitive of transformer models.

Also known as

self attention, scaled dot-product attention, attention mechanism