Tokenization

Tokenization converts raw text into discrete tokens that can be processed by a model.

Byte Pair Encoding (BPE)

BPE builds a vocabulary by iteratively merging the most frequent pair of symbols:

GPT-2/4 uses BPE. Common words become single tokens; rare words are split.

Used by BERT. Similar to BPE but maximizes likelihood of the training data rather than frequency.

score(A, B) = freq(AB) / (freq(A) · freq(B))

Language-independent tokenizer that treats the input as a raw stream of Unicode characters. Used by LLaMA, T5.

GPT-4: ~100,000 tokens. LLaMA 2: 32,000 tokens.