Prefill
The first phase of LLM inference where all input tokens are processed simultaneously through matrix-matrix multiplication, saturating GPU compute capacity.
Prefill occurs when a prompt first arrives at an LLM and the model processes all input tokens in parallel. This phase performs matrix-matrix multiplication, the type of highly parallelized work GPUs excel at, and is compute-bound rather than memory-bound. Prefill latency determines time to first token (TTFT) and scales with input length. Some serving architectures disaggregate prefill onto separate hardware optimized for its compute-heavy profile.
Also known as
prefill phase, prompt processing