One forward pass and one backward pass through a 2-2-1 sigmoid network, with explicit numbers at every step. The goal is to demystify \(\partial\mathcal{L}/\partial\mathbf{W}\) — it is arithmetic, not magic.
20.1 Setup
A network with 2 inputs, 2 hidden units, 1 output, all sigmoid. Fixed weights and a single training example so every number is concrete.
biases zero, loss = binary cross-entropy. The code below is followed rather than transcribe every multiplication — the point is that each step is a small, checkable number.
Apply the two backprop facts from the parent note.
Output error (sigmoid + BCE cancels to \(\hat{y} - y\)):
d2 = yhat - y # scalar error at the output pre-activationdW2 = d2 * a1 # gradient for W2 (outer product, here a vector)print("d2 =", round(d2, 4))print("dW2 =", dW2.round(4))
d2 = -0.3527
dW2 = [-0.1852 -0.2026]
Backprop into the hidden layer (\(\boldsymbol{\delta}^{(1)} = (\mathbf{W}^{(2)\top}\delta^{(2)}) \odot a_1(1-a_1)\)):
The analytic and numerical gradients agree — confirming the chain-rule bookkeeping is correct.
20.5 What this walkthrough teaches
Every gradient is a product of local terms already computed in the forward pass (\(a_1\), \(\hat{y}\)) — backprop reuses them rather than recomputing.
The outer-product shape \(\boldsymbol{\delta}\,\mathbf{a}^\top\) is why weight gradients have the same shape as the weights.
Gradient checking (analytic vs finite-difference) is the standard way to catch a backprop bug — a habit worth keeping when implementing layers by hand later in the timeline.