One hidden layer with a non-linearity solves XOR — the wall the perceptron hit. To train the hidden weights their gradients are needed, and backpropagation is just the chain rule applied layer by layer, reusing intermediate results: \[\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \boldsymbol{\delta}^{(l)} \, (\mathbf{a}^{(l-1)})^\top.\] Rumelhart, Hinton & Williams (1986) made this practical — and it still trains every network in this book.
The perceptron failed at XOR because one linear boundary can’t separate the diagonals. A multilayer perceptron (MLP) — one hidden layer of neurons, each with a non-linearity — can: the hidden units carve the input space into pieces a final linear layer can combine. Representational power was never the hard part once you add a hidden layer.
The hard part was training it. A hidden weight affects the loss only indirectly, through everything downstream of it. How much should it change? The answer — backpropagation(Rumelhart et al., 1986) — is the chain rule organized so that each layer’s gradient is computed once and reused. It is the algorithm that turned multilayer networks from a known idea into a trainable reality.
6.2 The mechanism
6.2.1 The MLP
A one-hidden-layer network maps input \(\mathbf{x}\) to output \(\hat{\mathbf{y}}\):
The non-linearity \(\phi\) (from chapter 03) is what stops this collapsing to one linear map.
6.2.2 Backprop = chain rule, organized
Training minimizes a loss \(\mathcal{L}(\hat{\mathbf{y}}, \mathbf{y})\) (the cross-entropy from chapter 02) by gradient descent. The target is \(\partial\mathcal{L}/\partial\mathbf{W}^{(l)}\) for every layer. Define the error signal at layer \(l\):
Output error. For a sigmoid output with cross-entropy loss, the error simplifies beautifully: \(\boldsymbol{\delta}^{(\text{out})} = \hat{\mathbf{y}} - \mathbf{y}\).
Backward recursion. Error propagates backward through the weights and the activation’s derivative: \[
\boldsymbol{\delta}^{(l)} = \big((\mathbf{W}^{(l+1)})^\top \boldsymbol{\delta}^{(l+1)}\big) \odot \phi'(\mathbf{z}^{(l)}).
\]
Then each weight gradient is an outer product of the error with the layer’s input: \(\partial\mathcal{L}/\partial\mathbf{W}^{(l)} = \boldsymbol{\delta}^{(l)}(\mathbf{a}^{(l-1)})^\top\).
The efficiency comes from reuse: \(\boldsymbol{\delta}^{(l)}\) is computed once and feeds both this layer’s weight gradient and the next layer back. That reuse — dynamic programming over the computational graph — is why backprop costs about the same as the forward pass.
6.2.3 Minimal sketch — an MLP solving XOR
A 2-2-1 network with manual backprop. Watch it learn the function the perceptron couldn’t.
The backward recursion multiplies by \(\phi'(\mathbf{z})\). A hard step/sign function (the original perceptron) has zero derivative almost everywhere, so no gradient flows — which is exactly why the move to smooth activations (ch 03) was a prerequisite for training depth.
6.3 Application & impact
Backpropagation is not legacy — it is the training algorithm, unchanged in principle since 1986. What grew around it is automation and scale.
6.3.1 What this becomes
Concept here
What it became
Where you see it today
Manual chain-rule recursion
Reverse-mode autodiff
PyTorch autograd, JAX grad, TensorFlow — every framework
One hidden layer solves XOR
Universal approximation
the theoretical license for deep nets (upcoming sidecar)
\(\boldsymbol{\delta}^{(l)}\) recursion
The backward pass
loss.backward() in every training loop
Outer-product weight gradients
Batched matmul gradients
every GPU training kernel
6.3.2 Concretely, where the lineage lands
loss.backward() in PyTorch is this recursion, generalized to an arbitrary computational graph by reverse-mode autodiff (deferred to a later chapter).
Training a transformer is the same two-phase forward/backward loop — only the graph is bigger and the matmuls run on a GPU.
The XOR demo above is, structurally, a tiny language model’s training step: forward, cross-entropy, backward, update.
NoteKey takeaway
A hidden layer buys representational power; backpropagation makes that power trainable by computing every parameter’s gradient in one organized backward sweep. Everything downstream — CNNs, RNNs, transformers — is this same forward/backward loop at larger scale.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536. https://doi.org/10.1038/323533a0