Why is the backward pass as efficient as the forward pass?

Backpropagation reuses intermediate computations from the forward pass and applies the chain rule systematically, requiring roughly the same number of operations as forward inference. This efficiency, rather than any algorithmic cleverness, is why training neural networks at scale is computationally feasible.

What causes vanishing gradients in deep networks?

Vanishing gradients occur when activation function derivatives are small (sigmoid outputs values between 0 and 1). During backpropagation, multiplying many small derivatives together produces tiny gradients for early layers, causing them to stop learning. ReLU activations, batch normalization, and skip connections are architectural fixes.

Backpropagation: The Chain Rule That Trains AI

Q: Who invented backpropagation?

Seppo Linnainmaa published the algorithm (as reverse-mode automatic differentiation) in his 1970 master's thesis. Paul Werbos first applied it to neural networks in 1982. Rumelhart, Hinton, and Williams made it famous with their 1986 Nature paper, but the core math predates their work by over a decade.

Neural networks aren't magic. They're nested mathematical functions with adjustable parameters, and backpropagation is the accounting system that figures out which adjustments make predictions better. Not mysterious "learning" in any philosophical sense. Calculus, applied systematically.

The core insight is almost disappointingly simple: when a network makes a wrong prediction, you can trace backward through every calculation to determine exactly how much each parameter contributed to the error. The chain rule from calculus lets you do this efficiently.

That's the entire algorithm.

Prediction as composition

A neural network is a series of simple operations composed together. Input flows through layers, each layer applies a transformation (multiply by weights, add bias, apply activation function), and the output of one layer becomes the input to the next. Think of it as a pipeline: data enters, gets transformed repeatedly, and a prediction comes out the other end.

The Stanford CS231n course frames this as a computational graph, where each node represents an operation and edges show how values flow from inputs to outputs. The forward pass is conceptually simple; you're just doing arithmetic. The interesting question is: when the prediction is wrong, which parameters should you adjust, and by how much?

This is where the chain rule comes in. Remember it from calculus? If you have a composed function like f(g(x)), the derivative of the whole thing with respect to x is f'(g(x)) × g'(x). You multiply the "outer" derivative by the "inner" derivative. Neural networks are just deeply composed functions. A modern language model might chain together billions of operations, but the chain rule still applies: to find how much a parameter deep in the network affects the final error, you multiply derivatives along the path from that parameter to the output.

Stanford's CS231n notes describe this elegantly: each "gate" (operation) in the network receives a gradient from upstream, multiplies it by its local gradient, and passes the result downstream. Add gates distribute gradients equally. Multiply gates swap and scale. Max gates route gradients only to the winning input.

"The derivative on each variable tells you the sensitivity of the whole expression on its value."

A large gradient means small changes to that parameter cause large changes to the error. A tiny gradient means the parameter barely matters.

Working backward

The algorithm works backward through the network. You compute the error at the output, then propagate gradient signals backward through each layer, accumulating the chain of derivatives as you go.

Jürgen Schmidhuber's historical account corrects a common misconception: Hinton didn't invent backpropagation. Seppo Linnainmaa published the algorithm (technically called "reverse-mode automatic differentiation") in his 1970 master's thesis. Paul Werbos first applied it to neural networks in 1982. Rumelhart, Hinton, and Williams made it famous with their 1986 Nature paper, demonstrating that it produces useful internal representations. But the math was already there.

The efficiency insight that makes this practical: the backward pass costs about as much as the forward pass. You're not doing extra work to compute gradients, which is why training neural networks at scale is feasible at all.

Concrete numbers make this clearer. Matt Mazur's worked example walks through a tiny 2-2-2 network with specific values: inputs of 0.05 and 0.10, initial weights like w1=0.15, w2=0.20. The forward pass calculation is straightforward: net_h1 = 0.15 × 0.05 + 0.2 × 0.1 + 0.35 = 0.3775. Apply activation, combine with other neurons, get a prediction.

The backward pass applies the chain rule step by step. For one weight: ∂E/∂w5 = ∂E/∂out × ∂out/∂net × ∂net/∂w5. Each term answers a specific question. How much does total error change when this output changes? How much does this output change when its net input changes? How much does net input change when this weight changes? Multiply those three numbers together, and you know exactly how to adjust that weight.

The proof that it works: in Mazur's example, the total error drops from 0.298 to 0.291 after one iteration. After 10,000 iterations, it's 0.0000351.

When gradients misbehave

The algorithm sounds clean, but deep networks created real problems that took decades to solve.

Google's machine learning course identifies three failure modes. Vanishing gradients: when derivatives are small (sigmoid activations produce values between 0 and 1), multiplying many small numbers together produces tiny gradients. Lower layers stop learning because the gradient signal has decayed to nearly zero. Exploding gradients are the opposite problem; if derivatives are large, their product can blow up exponentially, causing the network to diverge rather than converge. Dead ReLUs are a third issue: if a ReLU neuron's output goes negative, its gradient becomes zero permanently, and it stops participating in learning.

The fixes are architectural. ReLU activations help prevent vanishing gradients (their derivative is 1 for positive inputs, so gradients flow without shrinking). Batch normalization stabilizes training by controlling the scale of activations. Skip connections in ResNets let gradients flow directly from output to input, bypassing layers where they might decay. Everything since the 1986 paper (ReLU in 2010, batch norm in 2015, transformers in 2017) has been about making gradient flow work better through deeper networks.

MIT's computer vision textbook offers a striking perspective: the backward pass is itself a neural network. It's made of linear operations (Jacobian products) that transform error signals just like the forward pass transforms data. In a sense, training a neural network means running two networks simultaneously: one for prediction, one for learning. This perspective also clarifies why you can use backpropagation for things beyond training weights. You can optimize inputs instead of parameters; this is how adversarial examples and feature visualizations work. Data and parameters are treated symmetrically by the chain rule.

Strip away the jargon and you're left with a simple idea: neural networks are composed functions, the chain rule tells you how each parameter affects the output, and you can compute this efficiently by working backward. Not learning in any mystical sense. An algorithm for computing derivatives, plus an optimization method (gradient descent) that uses those derivatives to adjust parameters.

The chain rule is roughly 350 years old; Leibniz formalized it in 1676. What changed is that we now have the hardware to apply it trillions of times per second, across billions of parameters, on datasets large enough to find real patterns. The math hasn't changed since Newton's day. The scale has.

Prediction as composition

Working backward

When gradients misbehave

Frequently Asked Questions

Who invented backpropagation?

Why is the backward pass as efficient as the forward pass?

What causes vanishing gradients in deep networks?