Multi-Query Attention

An attention mechanism variant where all query heads share a single key-value pair, providing maximum KV cache reduction at the cost of some model quality.

Multi-query attention (MQA) is a memory optimization for transformer models proposed by Noam Shazeer in 2019. In standard multi-head attention, each query head has its own key and value heads. MQA collapses this by having all query heads share a single KV pair, reducing the KV cache size by a factor equal to the number of heads (e.g., 32x for a 32-head model). The tradeoff is reduced model expressiveness, as all query heads must work with the same key-value representation. MQA was largely superseded by grouped-query attention (GQA), which offers a tunable middle ground between MQA's aggressive memory savings and standard MHA's full expressiveness.

Also known as