Time per token

The latency required to generate each individual token during model inference, independent of total output length.

Time per token measures how quickly a language model produces each unit of output (a token, roughly a word fragment) during inference. It isolates raw serving speed from other factors like reasoning effort or output length. A reduction in time per token with unchanged model weights indicates infrastructure-level optimization rather than model simplification.

Also known as