Multi-Head Attention
A technique that runs multiple self-attention operations in parallel, each learning different types of relationships between tokens.
Instead of performing a single attention pass, multi-head attention splits the model's dimensions across multiple "heads" that each have independent query, key, and value weight matrices. One head might learn syntactic patterns while another tracks coreference. The outputs are concatenated and projected back to the model dimension, enabling richer pattern capture than single-head attention.
Also known as
MHA, multi head attention, parallel attention heads