KV Cache

A memory structure that stores previously computed key and value tensors during LLM inference, avoiding redundant computation but consuming significant GPU memory.

The KV (key-value) cache stores the key and value tensors from every previous token at every layer of the model during autoregressive generation. Without it, generating each new token would require recomputing attention over all prior tokens, scaling quadratically. With it, only the new token's keys and values need computing, reducing complexity to O(n). The tradeoff is memory: the cache grows linearly with sequence length and can consume multiple gigabytes per sequence, making it the central constraint in inference cost and concurrency.

Also known as