PagedAttention
A memory management technique for LLM serving that handles KV cache allocation like an operating system manages virtual memory, eliminating fragmentation and improving GPU utilization.
PagedAttention treats KV cache memory the way operating systems handle virtual memory: allocating fixed-size blocks on demand rather than reserving contiguous space for maximum possible sequence length. This eliminates memory fragmentation that occurs when sequences of varying lengths complete, allowing more concurrent requests to share GPU memory. PagedAttention is the core innovation behind vLLM and has become a standard approach in production LLM serving systems.
Also known as
paged attention, vLLM attention