Transformer Architecture
The "Attention Is All You Need" paper revolutionized NLP. Understanding the encoder-decoder structure, multi-head attention, and positional encodings.
Transformer Architecture
The Transformer, introduced by Vaswani et al. in 2017, replaced recurrent architectures with pure attention mechanisms.
The Core Idea
Traditional seq2seq models processed tokens sequentially. Transformers process all tokens in parallel, enabling far more efficient training on modern hardware.
Multi-Head Self-Attention
Each attention head learns to focus on different aspects of the input:
Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V
Multiple heads are concatenated and projected back:
MultiHead(Q, K, V) = Concat(head₁, ..., headₕ) · W_O
Positional Encoding
Since attention is position-agnostic, positional information must be injected explicitly:
PE(pos, 2i) = sin(pos / 10000^(2i / d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i / d_model))
Feed-Forward Networks
After each attention layer, every position is processed by an identical FFN:
FFN(x) = max(0, xW₁ + b₁) · W₂ + b₂
References
- Vaswani et al. (2017). Attention Is All You Need. NeurIPS.
- Illustrated Transformer — Jay Alammar