Byte Pair Encoding
A compression algorithm that iteratively merges frequent character pairs to build a vocabulary, used by most major LLMs to convert text into tokens.
Byte Pair Encoding (BPE) starts with individual characters and repeatedly merges the most common adjacent pairs until reaching a target vocabulary size. Originally developed in 1994 for data compression, it was adapted by OpenAI for GPT models. Modern variants like byte-level BPE operate on raw bytes rather than Unicode characters, ensuring any input can be represented. The algorithm's training corpus determines which text patterns become single tokens versus multi-token fragments.
Also known as
BPE, byte pair encoding, byte-level BPE