Time per token
The latency required to generate each individual token during model inference, independent of total output length.
Time per token measures how quickly a language model produces each unit of output (a token, roughly a word fragment) during inference. It isolates raw serving speed from other factors like reasoning effort or output length. A reduction in time per token with unchanged model weights indicates infrastructure-level optimization rather than model simplification.
Also known as
token latency, per-token latency, token generation speed