Every time you send a prompt to an LLM and watch tokens stream back, your request triggers hundreds of sequential neural network passes on a GPU that costs tens of thousands of dollars. Training gets the headlines, but inference is where the money goes: it runs 24/7, serves millions of users, and operates at roughly 10% hardware efficiency compared to training's 70%.
That efficiency gap is the story. Not expensive hardware, but badly utilized hardware.
Prefill vs. Decode: One GPU, Two Jobs
LLM inference isn't one operation. It's two fundamentally different workloads bolted together, and they have almost nothing in common.
Prefill happens when your prompt arrives. The model processes all your input tokens at once in a matrix-matrix multiplication, the kind of work GPUs were built for. It's highly parallelized and saturates GPU utilization. A 1,000-token prompt? The GPU crunches all 1,000 simultaneously.
Decode is the opposite. Generating the response means producing one token at a time, with each new token requiring a full forward pass through the model. The output of step 47 depends on step 46, so there's no parallelizing your way out. This phase is a matrix-vector operation that underutilizes the GPU's compute capacity. The GPU sits mostly idle, waiting on data.
Prefill is compute-bound: throw more math units at it and it gets faster. Decode is memory-bandwidth-bound: the bottleneck is how fast you can shuttle data from memory to the processing cores, not how fast the math runs.
Faster arithmetic doesn't help when the GPU is stalled on a memory read.
The JAX Scaling Book puts concrete numbers on this. Attention during decode has an arithmetic intensity of about 1, meaning one floating-point operation per byte loaded. Their verdict: "we're basically always memory bandwidth-bound during attention." No algorithmic trick changes that ratio.
The KV Cache
In attention, every token needs to "attend to" all previous tokens. Without caching, generating the 100th token means recomputing keys and values for all 99 tokens before it. Generating the 101st means recomputing all 100. The cumulative work scales quadratically: O(n²).
The KV cache fixes this by storing the key and value tensors from every previous token at every layer. Generate token 101 and you only compute the new token's keys and values, then look up everything else from cache. This drops complexity to O(n). Sebastian Raschka measured a roughly 5x speedup with KV caching enabled on a 124M parameter model generating 200 tokens. Without it, autoregressive generation would be impractically slow.
The catch is memory. The KV cache stores tensors for every token, at every layer, for every attention head. NVIDIA gives the per-token formula:
2 × num_layers × num_heads × dim_head × precision_bytes
For Llama 2 7B at 4K context with 16-bit precision, that's about 2GB just for the cache of a single sequence. Scale to a 13B model with 8K context and the numbers get uncomfortable: the JAX Scaling Book reports 6.7GB per sequence for Llama 2-13B.
Just four concurrent sequences at that length would exceed the memory footprint of the model parameters themselves (26GB).
This is why longer contexts cost more. It's not that the math gets harder; the KV cache grows linearly with sequence length, and loading all of it from memory dominates decode time. MatX's research quantifies this starkly: for an 8B model at batch size 128 with 8K context on an H100, loading the KV cache costs about 41 TFLOPs while the forward-pass computation is about 2 TFLOPs.
The GPU spends 20x more effort moving cached data than doing useful math.
When providers quote inference speed, the number can mean very different things depending on which phase they're measuring. Time to first token (TTFT) measures prefill latency: how long you wait before the first response token appears. It's driven by the compute-bound prefill phase and scales with input length. Time per output token (TPOT) measures decode speed: how fast tokens stream after that first one. It's driven by the memory-bandwidth-bound decode phase and depends heavily on model size, context length (because of KV cache loading), and how many other requests share the GPU.
Then there's throughput: total tokens per second across all concurrent requests. The JAX Scaling Book's benchmarks show how dramatically this varies with batch size. Llama 2-13B on 8x TPU v5e produces about 200 tokens/sec at batch 1, roughly 873 tokens/sec at batch 32, and about 964 tokens/sec at batch 240. Throughput increases nearly 5x from batch 1 to 32, then barely budges; the system hits a ceiling where it becomes compute-bound, at a critical batch size of around 240 tokens for bf16 on TPU v5e (about 280 on an H100).
Most real-world serving operates well below that critical batch size, firmly in memory-bandwidth-bound territory.
Batching and Its Tradeoffs
Serving systems try to pack multiple requests onto one GPU simultaneously. The logic is straightforward: if the GPU is memory-bandwidth-bound anyway, you might as well load the model weights once and use them for multiple requests in the same pass.
But each request needs its own KV cache allocation.
More concurrent requests means more memory devoted to caches, which means either shorter maximum context lengths or fewer concurrent requests. Pick two.
Modern serving frameworks attack this from multiple angles. Continuous batching (also called in-flight batching) immediately ejects finished sequences from a batch and slots in new ones, rather than making all requests wait for the longest one to finish. This alone dramatically improves GPU utilization in production. Architectural changes help too. Grouped-query attention (GQA) shrinks the KV cache by sharing key-value heads across query heads. The Scaling Book shows that a 5x reduction in KV cache size lets a system serve batch 64 at 3,757 tokens/sec, compared to 923 tokens/sec without it. That's not a minor gain; it's the difference between profitable and unprofitable serving.
Our read: the most important thing to understand about inference cost is that the hardware is being used badly, not that the hardware is expensive.
MatX's analysis puts the utilization gap at roughly 70% during training versus 10% during inference. Training processes massive batches of data with predictable, parallelizable workloads. Inference processes unpredictable requests, one token at a time, with each request demanding its own chunk of memory for the KV cache. The GPU's enormous compute capacity sits largely idle while it waits on memory reads.
Every major inference optimization maps directly to this diagnosis. Quantization (INT8, INT4) shrinks the model weights and KV cache, meaning less data to load per step. GQA reduces the number of KV heads to cache. PagedAttention manages cache memory like an operating system manages virtual memory, eliminating fragmentation. Disaggregated serving splits prefill and decode onto separate hardware tuned for each workload's profile. All variations on the same theme: make the KV cache smaller, or manage memory access patterns more efficiently.
Inference-Native Hardware Is Coming
The field is converging on a recognition that inference hardware should look different from training hardware. Training wants maximum FLOPs; inference wants maximum memory bandwidth per dollar. Companies like MatX, Groq, and Cerebras are building chips around this insight, optimizing for the memory-bound profile that defines the decode phase.
Model-level changes are chipping away from the other direction: longer-context architectures with linear attention, mixture-of-experts models that activate fewer parameters per token, aggressive KV cache compression.
The KV cache will remain the central constraint. It's what makes fast autoregressive generation possible, and it's what makes that generation expensive. Anyone building on top of LLMs should understand that context length, concurrency, and cost are fundamentally linked through this single data structure. "Making inference cheaper" is mostly a polite way of saying "managing the KV cache better."