Tokenization: Why Your Prompt Costs What It Costs

Tokenization sits between your text and every LLM. It sets API costs, taxes non-English languages, and quietly degrades accuracy for billions of speakers.

InfrastructuretokenizationLLM infrastructureAPI pricingmultilingual AI

Ask ChatGPT how many Rs are in "strawberry" and there's a decent chance it gets it wrong. Not because the model is bad at counting. Because the model never sees the letters at all.

When GPT-4's tokenizer processes "strawberry," it splits the word into three chunks: [str, aw, berry], mapped to token IDs [496, 675, 15717]. From the model's perspective, there is no word "strawberry." There are three opaque integers. Asking it to count the Rs is like asking someone to count brushstrokes in a Chinese character they can't read. The information simply isn't in the representation.

This is tokenization: the process that converts your text into the integer sequences a model actually computes over. It's the invisible layer that determines how much your API calls cost, why some languages are structurally more expensive than others, and where some of the sharpest fairness problems in modern AI quietly live.

How Byte Pair Encoding Actually Works

Nearly every major LLM uses some variant of Byte Pair Encoding, a compression algorithm originally developed in 1994 and repurposed by OpenAI for GPT. The idea is simple: start with individual characters, then iteratively merge the most frequent adjacent pairs until you reach a target vocabulary size.

Hugging Face's walkthrough makes this concrete. Imagine a tiny corpus containing the words "hug" (10 times), "pug" (5 times), "pun" (12 times), "bun" (4 times), and "hugs" (5 times). The base vocabulary is just the individual letters: [b, g, h, n, p, s, u]. The algorithm scans for the most common adjacent pair; that's (u, g), appearing 20 times across "hug," "pug," and "hugs." So it merges them into a new token ug. Then it rescans, finds the next most frequent pair (u, n) at 16 occurrences, merges that into un. Then (h, ug) becomes hug. The process keeps going until the vocabulary hits whatever size the designers chose.

Here's what matters: this process is trained on a specific corpus. Whatever text the tokenizer saw during training determines what gets compressed efficiently. English words that appeared millions of times become single tokens. Rare words, technical jargon, and text in underrepresented languages get shredded into fragments.

GPT-2 added a clever twist called byte-level BPE. Instead of starting from Unicode characters, it operates on raw bytes, giving it a base vocabulary of 256. This means it can represent anything (no input ever produces an "unknown token" error) but it also means the tokenizer treats text as a byte stream, not as human-readable characters.

That's why "strawberry" ends up as three seemingly arbitrary chunks instead of staying whole.

Vocabulary Size as Business Strategy

GPT-2 used roughly 50,000 tokens. GPT-4 uses about 100,000. Llama 3 jumped to 128,000. These aren't arbitrary numbers. Larger vocabularies mean common text compresses into fewer tokens, which means cheaper inference per sentence. Meta's own benchmarks show Llama 3's tokenizer produces about 15% fewer tokens than its predecessor for the same text.

That's a direct cost reduction.

But bigger vocabularies also mean bigger embedding tables, which means more GPU memory. Llama 3's 8B parameter model is larger than Llama 2's 7B partly because of the expanded vocabulary. The trade-off is real: compress text better at inference time, or keep the model smaller and easier to serve. When a provider doubles their vocabulary, they're making a bet about what text matters and who their users are.

One thing most developers don't realize: identical text produces different token counts depending on which provider's tokenizer processes it. OpenAI uses tiktoken. Google and Meta have used SentencePiece (though Llama 3 switched to tiktoken). These tokenizers were trained on different corpora with different merge rules, so the same sentence can vary significantly in token count between them. "Price per million tokens" is not an apples-to-apples comparison across providers. If you're doing serious cost optimization, you need to count tokens with each provider's specific tokenizer, not assume equivalence.

Non-English Speakers Pay More for Worse Results

This is where the consequences get sharp.

Tokenizers trained primarily on English text produce dramatically more tokens when processing other languages. The mechanism is straightforward: if Arabic or Yoruba text barely appeared in the tokenizer's training corpus, those languages never got their common words compressed into efficient single tokens. Every word gets fragmented into more pieces.

Research from Petrov et al. (NeurIPS 2023) documented tokenization length differences of up to 15x between languages for identical translated text. Even byte-level and character-level models showed over 4x differences in encoding length for some language pairs. This is structural, baked into the tokenizer before the model is even invoked.

The Token Tax paper quantified the economic fallout:

Because transformer attention scales quadratically with sequence length, a 2x increase in token count doesn't just double your cost; it quadruples it.

For a model like Llama 3.1 405B, that's the difference between roughly $105 million in training costs for English and $420 million for a language with doubled fertility. At inference time, costs and latency both scale directly with token count.

The cost isn't just financial. Fertility also degrades quality. The same research found that each additional token per word reduces model accuracy by 8 to 18 percentage points, depending on the subject and model. Fertility alone explains 20-50% of variance in accuracy across their analyses. Languages that get inefficiently tokenized aren't just more expensive to use; the models are measurably worse at them.

Our read: This is one of AI's clearest fairness problems, hiding in plain sight. Non-English speakers pay more for worse service, and the cause is a data structure decision made years ago when someone chose what corpus to train the tokenizer on. Meta's Llama 3 tokenizer was trained on data that's 95% English and code. The outcome we're seeing is exactly what you'd expect.

Vocabularies Are Getting Bigger, Fast

Progress is real, and it's accelerating. Vocabulary sizes have climbed steadily (from 32K to 50K to 100K to 128K) and then made a sharp jump: GPT-4o's o200k_base tokenizer doubled the vocabulary to 200,000 tokens, while Google's Gemma pushed to 256,000. OpenAI's latest o200k_harmony tokenizer, released alongside GPT-5, extends to 201,088 tokens with added support for structured conversational inputs.

The trend line isn't linear anymore. It's steepening.

The bigger story is what those expanded vocabularies actually fix. GPT-4o's tokenizer didn't just get larger; it was explicitly redesigned for multilingual efficiency. For Indian languages like Malayalam, token usage dropped roughly 4x compared to GPT-4's cl100k_base. Kannada and Telugu saw similar gains. The old tokenizer was splitting a simple Tamil letter-diacritic combination like நீ ("you") into four separate tokens; the new one handles it as a single unit. That's not incremental. For anyone building products in those markets, it's a cost structure that went from prohibitive to workable.

Some researchers are pushing further, investigating multilingually fair tokenizers trained on balanced corpora. Reasoning models like DeepSeek and o1 appear to narrow the accuracy gap somewhat, improving African language performance by 8-12 points, which suggests better architectures can partially compensate for tokenizer bias.

Then there's the most radical idea: skip tokenization entirely. Meta's Byte Latent Transformer (BLT), presented at ACL 2025, processes raw bytes instead of tokens. It dynamically groups bytes into variable-sized "patches" based on information density, allocating more compute to complex sequences and less to predictable ones. BLT matches Llama 3 performance while using up to 50% fewer FLOPs at inference, and it outperforms token-based models on tasks like character manipulation and noisy text by wide margins.

Because it operates on bytes, every language gets treated identically at the input level. No corpus bias, no fertility penalty, no multilingual tax baked into the architecture.

BLT is still a research project, not a production system. But it's the first byte-level architecture to match BPE-based models at scale, and it points toward a future where the entire tokenization problem dissolves rather than getting patched vocabulary by vocabulary.

For developers building products that serve non-English users, the near-term picture is genuinely better than it was a year ago. Larger vocabularies and smarter tokenizer design are closing the gap. But as long as pricing is per-token and most tokenizers are trained on English-dominant data, the multilingual tax persists for many languages. As we previously covered, inference cost is fundamentally about how efficiently you move data through memory; tokenization determines how much data there is to move in the first place.

Frequently Asked Questions