6 MLP & Backpropagation

TL;DR

One hidden layer with a non-linearity solves XOR — the wall the perceptron hit. To train the hidden weights their gradients are needed, and backpropagation is just the chain rule applied layer by layer, reusing intermediate results: \[\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \boldsymbol{\delta}^{(l)} \, (\mathbf{a}^{(l-1)})^\top.\] Rumelhart, Hinton & Williams (1986) made this practical — and it still trains every network in this book.

Depends on: Activations & Non-linearity

6.1 Why this matters

The perceptron failed at XOR because one linear boundary can’t separate the diagonals. A multilayer perceptron (MLP) — one hidden layer of neurons, each with a non-linearity — can: the hidden units carve the input space into pieces a final linear layer can combine. Representational power was never the hard part once you add a hidden layer.

The hard part was training it. A hidden weight affects the loss only indirectly, through everything downstream of it. How much should it change? The answer — backpropagation (Rumelhart et al., 1986) — is the chain rule organized so that each layer’s gradient is computed once and reused. It is the algorithm that turned multilayer networks from a known idea into a trainable reality.

6.2 The mechanism

6.2.1 The MLP

A one-hidden-layer network maps input \(\mathbf{x}\) to output \(\hat{\mathbf{y}}\):

\[ \mathbf{z}^{(1)} = \mathbf{W}^{(1)}\mathbf{x} + \mathbf{b}^{(1)}, \quad \mathbf{a}^{(1)} = \phi(\mathbf{z}^{(1)}), \quad \hat{\mathbf{y}} = \phi\!\big(\mathbf{W}^{(2)}\mathbf{a}^{(1)} + \mathbf{b}^{(2)}\big). \]

The non-linearity \(\phi\) (from chapter 03) is what stops this collapsing to one linear map.

6.2.2 Backprop = chain rule, organized

Training minimizes a loss \(\mathcal{L}(\hat{\mathbf{y}}, \mathbf{y})\) (the cross-entropy from chapter 02) by gradient descent. The target is \(\partial\mathcal{L}/\partial\mathbf{W}^{(l)}\) for every layer. Define the error signal at layer \(l\):

\[ \boldsymbol{\delta}^{(l)} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(l)}}. \]

Two facts give the whole algorithm:

Output error. For a sigmoid output with cross-entropy loss, the error simplifies beautifully: \(\boldsymbol{\delta}^{(\text{out})} = \hat{\mathbf{y}} - \mathbf{y}\).
Backward recursion. Error propagates backward through the weights and the activation’s derivative: \[ \boldsymbol{\delta}^{(l)} = \big((\mathbf{W}^{(l+1)})^\top \boldsymbol{\delta}^{(l+1)}\big) \odot \phi'(\mathbf{z}^{(l)}). \]

Then each weight gradient is an outer product of the error with the layer’s input: \(\partial\mathcal{L}/\partial\mathbf{W}^{(l)} = \boldsymbol{\delta}^{(l)}(\mathbf{a}^{(l-1)})^\top\).

The efficiency comes from reuse: \(\boldsymbol{\delta}^{(l)}\) is computed once and feeds both this layer’s weight gradient and the next layer back. That reuse — dynamic programming over the computational graph — is why backprop costs about the same as the forward pass.

6.2.3 Minimal sketch: an MLP solving XOR

A 2-2-1 network with manual backprop. Watch it learn the function the perceptron couldn’t.

import numpy as np
rng = np.random.default_rng(1)

X = np.array([[0,0],[0,1],[1,0],[1,1]], dtype=float)
y = np.array([[0],[1],[1],[0]], dtype=float)        # XOR

sigmoid = lambda z: 1 / (1 + np.exp(-z))

W1 = rng.normal(size=(2, 2)); b1 = np.zeros((1, 2))
W2 = rng.normal(size=(2, 1)); b2 = np.zeros((1, 1))
lr = 0.5

for epoch in range(8000):
    a1   = sigmoid(X @ W1 + b1)            # hidden activations  (4,2)
    yhat = sigmoid(a1 @ W2 + b2)           # output              (4,1)

    d2 = (yhat - y) / len(X)               # output error  (BCE + sigmoid)
    dW2 = a1.T @ d2;  db2 = d2.sum(0, keepdims=True)
    d1 = (d2 @ W2.T) * a1 * (1 - a1)       # backprop through hidden
    dW1 = X.T @ d1;   db1 = d1.sum(0, keepdims=True)

    W2 -= lr*dW2; b2 -= lr*db2
    W1 -= lr*dW1; b1 -= lr*db1

pred = sigmoid(sigmoid(X @ W1 + b1) @ W2 + b2)
print("predictions:", pred.round(2).ravel())
print("targets:    ", y.ravel())

predictions: [0. 1. 1. 0.]
targets:     [0. 1. 1. 0.]

The predictions snap to the XOR targets — one hidden layer plus backprop clears the wall from chapter 01.

Going deeper

Numerical walkthrough — backprop by hand on a 2-2-1 net — every partial derivative of one forward+backward pass, with explicit numbers.

Pitfall: backprop needs differentiable activations

The backward recursion multiplies by \(\phi'(\mathbf{z})\). A hard step/sign function (the original perceptron) has zero derivative almost everywhere, so no gradient flows — which is exactly why the move to smooth activations (ch 03) was a prerequisite for training depth.

6.3 Application & impact

Backpropagation is not legacy — it is the training algorithm, unchanged in principle since 1986. What grew around it is automation and scale.

6.3.1 What this becomes

Concept here	What it became	Where you see it today
Manual chain-rule recursion	Reverse-mode autodiff	PyTorch `autograd`, JAX `grad`, TensorFlow — every framework
One hidden layer solves XOR	Universal approximation	the theoretical license for deep nets (upcoming sidecar); raises the question of capacity — see Prerequisites
\(\boldsymbol{\delta}^{(l)}\) recursion	The backward pass	`loss.backward()` in every training loop
Outer-product weight gradients	Batched matmul gradients	every GPU training kernel

6.3.2 Concretely, where the lineage lands

loss.backward() in PyTorch is this recursion, generalized to an arbitrary computational graph by reverse-mode autodiff (deferred to a later chapter).
Training a transformer is the same two-phase forward/backward loop — only the graph is bigger and the matmuls run on a GPU.
The XOR demo above is, structurally, a tiny language model’s training step: forward, cross-entropy, backward, update.

Key takeaway

A hidden layer buys representational power; backpropagation makes that power trainable by computing every parameter’s gradient in one organized backward sweep. Everything downstream — CNNs, RNNs, transformers — is this same forward/backward loop at larger scale.

Parallel track: Convolutional Networks

While MLP + backprop was being explored for sequences, LeCun et al. (1989) applied the same forward/backward loop to convolutional layers — weight sharing across spatial positions — producing LeNet, the first practical image classifier. The convolution idea (local receptive fields, shared weights, pooling) forms a parallel architectural lineage that rejoins the sequence-modeling timeline via ResNet’s residual connections (2015) and ConvSeq2Seq (2017). Vision CNNs are outside the scope of this book; the concepts resurface where they directly influence sequence architectures.

→ Next: Gradient Flow & Vanishing Gradients