Attention Mechanisms

Attention allows models to selectively focus on relevant parts of the input when producing each output token.

Scaled Dot-Product Attention

The core operation in transformers:

Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V

The scaling factor √d_k prevents vanishing gradients in the softmax.

In encoder-decoder models, the decoder queries the encoder's output:

Flash Attention (Dao et al., 2022) is an IO-aware exact attention algorithm that:

In autoregressive language models, future tokens must be masked:

mask[i][j] = -∞ if j > i else 0

Dao et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness