Tokenization

The process of converting text into integer sequences (tokens) that a language model computes over.

Tokenization is the preprocessing step that splits input text into discrete units called tokens, which are mapped to integer IDs from a fixed vocabulary. Modern LLMs typically use subword tokenization algorithms like Byte Pair Encoding (BPE), which learn to compress frequent character sequences into single tokens. The tokenizer's training corpus determines which text patterns get compressed efficiently, making tokenization a key factor in API cost, model performance, and cross-language fairness.

Also known as