Continuous Batching

A serving technique that dynamically adds and removes requests from a GPU batch as they start and finish, rather than waiting for all requests to complete.

Continuous batching (also called in-flight batching) is an inference serving optimization that immediately ejects completed sequences from a batch and inserts new requests, rather than making all requests in a batch wait for the longest one to finish. This dramatically improves GPU utilization in production serving by keeping the batch full and reducing idle time.

Also known as

in-flight batching, dynamic batching