Decode

The second phase of LLM inference where output tokens are generated one at a time, bottlenecked by memory bandwidth rather than compute.

Decode is the autoregressive generation phase of LLM inference where each output token is produced sequentially, with each step depending on the previous one. This phase uses matrix-vector operations that underutilize GPU compute capacity, making it memory-bandwidth-bound. The bottleneck is loading model weights and KV cache data from memory, not performing arithmetic. This fundamental characteristic drives most inference optimization research.

Also known as