Tokenization

The process of converting text into integer sequences that language models actually compute over, determining API costs and encoding efficiency across languages.

Tokenization is the preprocessing step that breaks text into discrete units (tokens) mapped to integer IDs before a language model processes them. Most modern LLMs use Byte Pair Encoding variants, where vocabulary size and training corpus determine which text patterns compress efficiently. Because tokenizers are typically trained on English-dominant data, non-English languages often produce more tokens for equivalent text, creating cost and quality disparities.

Also known as

tokenizer, text tokenization, LLM tokenization