Attention Mechanisms
A deep dive into attention mechanisms: scaled dot-product attention, cross-attention, and flash attention.
Attention Mechanisms
Attention allows models to selectively focus on relevant parts of the input when producing each output token.
Scaled Dot-Product Attention
The core operation in transformers:
Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V
The scaling factor √d_k prevents vanishing gradients in the softmax.
Cross-Attention
In encoder-decoder models, the decoder queries the encoder's output:
- Q comes from the decoder
- K and V come from the encoder
Flash Attention
Flash Attention (Dao et al., 2022) is an IO-aware exact attention algorithm that:
- Tiles the attention computation to stay in fast SRAM
- Avoids materializing the full N×N attention matrix
- Achieves 2-4× speedup with reduced memory usage
Causal Masking
In autoregressive language models, future tokens must be masked:
mask[i][j] = -∞ if j > i else 0
References
- Dao et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness