LLM Inference

The process of generating outputs from a trained large language model, consisting of prefill (processing the input) and decode (generating tokens one at a time) phases.

LLM inference is the production workload of serving model responses to users, distinct from training. It comprises two phases: prefill, which processes all input tokens in parallel and is compute-bound, and decode, which generates output tokens sequentially and is memory-bandwidth-bound. Inference typically runs at roughly 10% hardware efficiency compared to training's 70%, primarily because the decode phase underutilizes GPU compute capacity while waiting on memory operations.

Also known as