Grouped-Query Attention

An attention mechanism variant that shares key-value heads across multiple query heads, reducing KV cache memory requirements while maintaining model quality.

Grouped-query attention (GQA) is an architectural optimization that reduces the memory footprint of the KV cache by having multiple query heads share the same key-value heads. This can shrink KV cache size by 5x or more, enabling higher batch sizes and longer context lengths within the same memory budget. Benchmarks show GQA can quadruple throughput compared to standard multi-head attention, making it a critical technique for cost-effective LLM serving at scale.

Also known as

GQA, grouped query attention