KV Cache
A memory structure that stores computed key and value tensors from previous tokens during LLM inference, eliminating redundant computation in autoregressive generation.
The KV cache stores the key and value tensors from the attention mechanism for every previously processed token at every layer of the model. Without it, generating each new token would require recomputing attention for all prior tokens, causing O(n²) complexity. With caching, complexity drops to O(n), dramatically speeding up generation. However, the cache grows linearly with context length and consumes significant GPU memory, making it the central constraint in LLM serving economics.
Also known as
key-value cache, KV-cache, attention cache