🏗️ Architectures

Attention Mechanisms

A deep dive into attention mechanisms: scaled dot-product attention, cross-attention, and flash attention.

Attention Mechanisms

Attention allows models to selectively focus on relevant parts of the input when producing each output token.

Scaled Dot-Product Attention

The core operation in transformers:

Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V

The scaling factor √d_k prevents vanishing gradients in the softmax.

Cross-Attention

In encoder-decoder models, the decoder queries the encoder's output:

  • Q comes from the decoder
  • K and V come from the encoder

Flash Attention

Flash Attention (Dao et al., 2022) is an IO-aware exact attention algorithm that:

  1. Tiles the attention computation to stay in fast SRAM
  2. Avoids materializing the full N×N attention matrix
  3. Achieves 2-4× speedup with reduced memory usage

Causal Masking

In autoregressive language models, future tokens must be masked:

mask[i][j] = -∞ if j > i else 0

References

  • Dao et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness