Continuous Batching
A serving technique that dynamically adds and removes requests from a GPU batch as they start and finish, rather than waiting for all requests to complete.
Continuous batching (also called in-flight batching) is an inference serving optimization that immediately ejects completed sequences from a batch and inserts new requests, rather than making all requests in a batch wait for the longest one to finish. This dramatically improves GPU utilization in production serving by keeping the batch full and reducing idle time.
Also known as
in-flight batching, dynamic batching