Multi-Head Attention
Running multiple attention operations in parallel, each with its own learned weight matrices, so the model can capture different types of relationships simultaneously.
Multi-head attention splits the attention computation into multiple parallel heads (8 in the original transformer), each with independent Q/K/V weight matrices. Different heads learn to focus on different relationship types: syntactic structure, coreference, positional proximity, and more. Their outputs are concatenated and projected back to model dimension, giving the model richer representational capacity than a single attention pass.
Also known as
MHA