🧠 Fundamentals

Tokenization

How text is split into tokens for LLM input: BPE, WordPiece, SentencePiece, and the vocabulary trade-offs.

Tokenization

Tokenization converts raw text into discrete tokens that can be processed by a model.

Byte Pair Encoding (BPE)

BPE builds a vocabulary by iteratively merging the most frequent pair of symbols:

  1. Start with a character-level vocabulary
  2. Count all adjacent pairs
  3. Merge the most frequent pair into a new token
  4. Repeat until vocabulary size is reached

GPT-2/4 uses BPE. Common words become single tokens; rare words are split.

WordPiece

Used by BERT. Similar to BPE but maximizes likelihood of the training data rather than frequency.

score(A, B) = freq(AB) / (freq(A) · freq(B))

SentencePiece

Language-independent tokenizer that treats the input as a raw stream of Unicode characters. Used by LLaMA, T5.

Vocabulary Size Trade-offs

Small vocab Large vocab
More tokens per sentence Fewer tokens per sentence
Handles OOV words May miss rare words
Slower inference Faster inference

GPT-4: ~100,000 tokens. LLaMA 2: 32,000 tokens.

Tokenizer Quirks

  • Numbers are often split digit by digit
  • Spaces are part of tokens in most modern tokenizers
  • Capitalization creates different tokens