Decode

The second phase of LLM inference where output tokens are generated one at a time, each requiring a full forward pass that is bottlenecked by memory bandwidth rather than compute.

Decode is the autoregressive generation phase where each new token depends on all previous tokens, preventing parallelization. This phase performs matrix-vector operations that underutilize GPU compute capacity, leaving the hardware mostly idle while waiting on memory reads. The decode phase is memory-bandwidth-bound: the bottleneck is how fast data can be shuttled from memory to processing cores, not how fast arithmetic runs. This fundamental characteristic explains why inference hardware utilization is so much lower than training.

Also known as