Continuous Batching

A serving technique that immediately ejects finished sequences from a batch and slots in new requests, rather than waiting for all requests in a batch to complete.

Continuous batching (also called in-flight batching) dynamically manages GPU batches by removing completed sequences and inserting new requests on the fly. Traditional static batching forces all requests to wait for the longest one to finish, wasting GPU cycles. Continuous batching dramatically improves utilization by keeping the GPU consistently fed with work. This technique is a standard feature in modern LLM serving frameworks like vLLM and TensorRT-LLM.

Also known as