Tokenization
How text is split into tokens for LLM input: BPE, WordPiece, SentencePiece, and the vocabulary trade-offs.
Tokenization
Tokenization converts raw text into discrete tokens that can be processed by a model.
Byte Pair Encoding (BPE)
BPE builds a vocabulary by iteratively merging the most frequent pair of symbols:
- Start with a character-level vocabulary
- Count all adjacent pairs
- Merge the most frequent pair into a new token
- Repeat until vocabulary size is reached
GPT-2/4 uses BPE. Common words become single tokens; rare words are split.
WordPiece
Used by BERT. Similar to BPE but maximizes likelihood of the training data rather than frequency.
score(A, B) = freq(AB) / (freq(A) · freq(B))
SentencePiece
Language-independent tokenizer that treats the input as a raw stream of Unicode characters. Used by LLaMA, T5.
Vocabulary Size Trade-offs
| Small vocab | Large vocab |
|---|---|
| More tokens per sentence | Fewer tokens per sentence |
| Handles OOV words | May miss rare words |
| Slower inference | Faster inference |
GPT-4: ~100,000 tokens. LLaMA 2: 32,000 tokens.
Tokenizer Quirks
- Numbers are often split digit by digit
- Spaces are part of tokens in most modern tokenizers
- Capitalization creates different tokens