Byte Pair Encoding

A compression algorithm that iteratively merges the most frequent adjacent character pairs to build a subword vocabulary.

Byte Pair Encoding (BPE) starts with a base vocabulary of individual characters (or bytes, in byte-level BPE) and repeatedly merges the most common adjacent pairs in a training corpus until reaching a target vocabulary size. Originally a data compression technique from 1994, BPE was adapted by OpenAI for GPT's tokenizer and is now used by nearly every major LLM. The merge rules learned during training determine how efficiently different texts are represented as tokens.

Also known as