Grouped-Query Attention
An attention architecture that reduces KV cache size by sharing key-value heads across multiple query heads, improving inference throughput.
Grouped-query attention (GQA) is a modification to the standard multi-head attention mechanism where multiple query heads share a single set of key and value heads. This reduces the KV cache memory footprint proportionally to the grouping ratio, allowing higher batch sizes and throughput during inference. Benchmarks show a 5x KV cache reduction can improve serving throughput by roughly 4x.
Also known as
GQA