Transformer Architecture

The Transformer, introduced by Vaswani et al. in 2017, replaced recurrent architectures with pure attention mechanisms.

The Core Idea

Traditional seq2seq models processed tokens sequentially. Transformers process all tokens in parallel, enabling far more efficient training on modern hardware.

Multi-Head Self-Attention

Each attention head learns to focus on different aspects of the input:

Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V

Multiple heads are concatenated and projected back:

MultiHead(Q, K, V) = Concat(head₁, ..., headₕ) · W_O

Positional Encoding

Since attention is position-agnostic, positional information must be injected explicitly:

PE(pos, 2i)   = sin(pos / 10000^(2i / d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i / d_model))

Feed-Forward Networks

After each attention layer, every position is processed by an identical FFN:

FFN(x) = max(0, xW₁ + b₁) · W₂ + b₂

References

Vaswani et al. (2017). Attention Is All You Need. NeurIPS.
Illustrated Transformer — Jay Alammar