Vocabulary size

The total number of unique tokens a tokenizer can represent, typically ranging from 32K to 128K in modern LLMs.

Vocabulary size is a design parameter that determines how many distinct tokens a tokenizer's lookup table contains. Larger vocabularies compress common text into fewer tokens (reducing inference cost per sentence) but require larger embedding tables that consume more GPU memory. GPT-2 used roughly 50K tokens, GPT-4 uses about 100K, and Llama 3 uses 128K. The choice of vocabulary size reflects engineering tradeoffs between inference efficiency and model size.

Also known as