GQA and MQA: How Modern LLMs Cut Memory in Half

Grouped-query attention and multi-query attention shrink KV cache by sharing key-value heads. GQA with 8 groups now powers Llama, Mistral, and nearly every production model.

Researchtransformersoptimizationinferencememory

Headline: GQA and MQA: How Modern LLMs Cut Memory in Half

Every time you use a large language model, something expensive is happening behind the scenes: the model stores key and value tensors for every token it has seen. This is the KV cache, and for a 70-billion-parameter model processing a long conversation, it can consume tens of gigabytes of GPU memory.

Grouped-query attention (GQA) and multi-query attention (MQA) make this manageable. If you've used Llama 2, Llama 3, Mistral, or Mixtral, you've already used GQA. It's in virtually every production model shipped in the last two years.

Inference is memory-bound, not compute-bound

Most discussions of transformer efficiency focus on compute. How many FLOPs does attention require? How does that n×n matrix scale with sequence length? But during autoregressive generation (the token-by-token output you see from a chatbot), the bottleneck shifts. The model isn't waiting on matrix multiplication. It's waiting on memory bandwidth.

The original MQA paper by Noam Shazeer made this observation explicit: standard attention is bottlenecked by memory bandwidth, not compute. The GPU can multiply matrices faster than it can load them from memory.

That changes everything about where to optimize.

The KV cache formula is straightforward: layers × batch size × heads × head dimension × sequence length × 2 (for K and V) × 2 (for 16-bit precision). Run the numbers for a model with 80 layers, 64 heads, 128-dimensional head embeddings, and a 100K token context, and you get roughly 160GB just for the cache. More than fits on a single H100.

The MQA gambit

Shazeer's 2019 paper proposed a radical simplification. In standard multi-head attention, each of the (say) 32 query heads has its own key and value heads. MQA collapses this entirely: all 32 query heads share a single key-value pair. The result is a 32x reduction in KV cache size.

The tradeoff is quality. Sharing a single KV representation across all query heads means the model has less capacity to encode different types of relationships. The paper reported that quality degradation was minor compared to the speed gains, but "minor" is relative. For frontier-scale models where every fraction of a percent matters, even small quality losses compound.

Falcon 7B uses MQA. Maximum speed, some quality cost.

Then came 2023. Google researchers published the GQA paper, which generalized MQA with a simple insight: instead of all heads sharing one KV pair, group them. With G groups, you have G sets of key-value heads, and each group of query heads shares one of them. GQA with G=1 is MQA. GQA with G=H (where H is the number of heads) is standard multi-head attention. Everything in between is a tradeoff.

The industry settled on G=8 as the sweet spot.

Why 8? The empirical results tell the story. On the T5-XXL model, GQA with 8 groups achieved 43.5 ROUGE, compared to 43.8 for standard multi-head attention (MHA) and 43.0 for MQA. You recover almost all the quality while keeping most of the speed. The same paper showed that GQA-8 reduces inference time to roughly one-fifth of MHA while nearly matching its quality.

The second insight was even more practical: you don't have to train from scratch. Existing MHA checkpoints can be converted to GQA with only 5% of the original pretraining compute.

This "uptraining" approach made adoption trivial. Meta adopted GQA for Llama 2 in July 2023 and retained it for Llama 3.

Long context and the memory wall

The KV cache problem gets worse as context windows grow. A 1 million token context would require cache sizes measured in terabytes under standard MHA. FlashAttention (which we covered previously) solves the computational side by avoiding n×n matrix materialization. GQA solves the memory side by shrinking what needs to be stored. These are complementary optimizations. FlashAttention makes attention compute-efficient; GQA makes the KV cache memory-efficient.

Together, they're why you can have a 128K context conversation without the server falling over.

IBM's explanation of GQA puts it clearly: "MHA maximizes accuracy at the cost of memory bandwidth; MQA maximizes speed at the expense of accuracy; GQA balances both." That balance isn't free. GQA-8 still loses something compared to full MHA. But for most applications, the inference cost savings outweigh the marginal quality loss.

Our read: the fact that every major model has converged on GQA suggests the tradeoff is favorable. When Mistral, Meta, and Falcon's successors all make the same architectural choice, it's not a coincidence. The quality loss is small enough that users don't notice; the inference savings are large enough that operators do.

The optimization space isn't fully explored. A December 2025 paper on Mixture of Attention Schemes (MoAS) showed that dynamic per-token routing between MHA, GQA, and MQA can outperform any static choice. Their validation loss of 2.3074 beat static mixtures at 2.3093. The difference is small, but it suggests future models might adaptively select attention schemes rather than committing to one upfront.

For now, GQA with G=8 is the default. It's a rare case where the industry genuinely converged on a best practice rather than fragmenting into competing approaches. If you're building inference infrastructure, assume GQA. If you're training a model from scratch, start with G=8 and only deviate if you have a specific reason.

The KV cache bottleneck isn't going away. Context windows keep growing. GQA is how production models stay viable.

Frequently Asked Questions