Backpropagation

Backpropagation is the algorithm used to compute gradients in neural networks efficiently.

Chain Rule

Given a composition of functions L = f(g(x)), the chain rule gives:

dL/dx = (dL/df) · (df/dg) · (dg/dx)

Backprop applies this recursively from the output layer back to the input.

During the forward pass, activations are computed and cached at each layer:

z = Wx + b
a = σ(z)

Gradients flow backward through the network:

δL/δW = δL/δa · δa/δz · δz/δW

Once gradients are computed, weights are updated:

θ = θ - α · ∇_θ L

where α is the learning rate.

Vanishing gradients: gradients shrink exponentially in deep nets. Solved by ReLU, residual connections, batch norm.
Exploding gradients: gradients grow unboundedly. Solved by gradient clipping.