3 The Artificial Neuron: From McCulloch-Pitts to the Perceptron

TL;DR

An artificial neuron is a weighted sum followed by a non-linearity: \(\hat{y} = \phi(\mathbf{w}^\top\mathbf{x} + b)\). McCulloch & Pitts (1943) fixed the structure — no learning. Rosenblatt (1958) added the learning rule and a convergence proof. Together they are the seed every later architecture grows from.

Depends on: Before the Perceptron — Prehistory

3.1 Why this matters

Two ideas had to fuse before anything “neural” could exist as a learning machine: what a neuron is (a thresholded weighted sum) and how it changes when shown labelled examples (a parameter update rule). McCulloch & Pitts (McCulloch & Pitts, 1943) gave the first; Rosenblatt (Rosenblatt, 1958) gave the second. Every architecture in this timeline composes, smooths, or scales this same skeleton — the perceptron is not just history, it is the unit cell.

It also matters for what it cannot do. Minsky & Papert (Minsky & Papert, 1969) proved that a single-layer perceptron cannot learn XOR — one linear boundary cannot separate non-linearly-separable classes. That single negative result motivates the multi-layer perceptron, and through it, every deep network.

The setup: supervised learning

Everything in this chapter (and most of the book) lives in the supervised learning paradigm. The setup is a dataset of labelled examples \(\{(\mathbf{x}_i, y_i)\}\) — each input \(\mathbf{x}_i\) (the features) paired with a target \(y_i\) (the label). The model is a parametric function \(f_\theta\); learning means adjusting parameters \(\theta\) so \(f_\theta(\mathbf{x})\) matches \(y\) on examples it has seen (training) and, crucially, on examples it has not (generalization, measured on a held-out test set). The perceptron is the simplest instance: \(\theta = (\mathbf{w}, b)\), and “matches” means the predicted sign equals the label.

3.2 The mechanism

3.2.1 Step 1: McCulloch-Pitts neuron: structure, no learning

A McCulloch-Pitts (MP) unit takes binary inputs \(x_i \in \{0,1\}\), weights them (originally with \(\pm 1\)), sums, and thresholds:

\[ y = \mathbb{1}\!\left[\sum_i w_i x_i \ge \theta\right] \]

The weights and threshold are fixed by the designer. Networks of such units can implement any propositional logic function — but there is no mechanism for the network to change itself from data.

Pitfall: MP is not a learning machine

MP neurons are often described as “the first neural network”, which is true structurally but misleading practically. They cannot learn; their weights are programmed, not trained. The leap to learning is exactly what the perceptron adds.

3.2.2 Step 2: Perceptron: same structure + a learning rule

Rosenblatt’s perceptron generalises MP to real-valued inputs and weights, swaps the indicator for a sign function, and crucially introduces an update rule:

\[ \hat{y} = \mathrm{sign}(\mathbf{w}^\top \mathbf{x} + b) \quad\in\{-1,+1\} \]

Geometrically \(\mathbf{w}^\top\mathbf{x}+b=0\) is a hyperplane in \(\mathbb{R}^d\); the perceptron classifies by which side a point falls on. Given a labelled example \((\mathbf{x}_i, y_i)\) with \(y_i \in \{-1,+1\}\), the perceptron updates only when it makes a mistake:

\[ \text{if } y_i(\mathbf{w}^\top\mathbf{x}_i + b) \le 0: \quad \mathbf{w} \leftarrow \mathbf{w} + \eta\, y_i \mathbf{x}_i, \quad b \leftarrow b + \eta\, y_i \]

Intuition: a misclassified positive pushes \(\mathbf{w}\) toward it; a misclassified negative pushes it away. The boundary rotates until it separates the data — if separation is possible.

Convergence (Novikoff, 1962). If the data is linearly separable with margin \(\gamma>0\) and \(\|\mathbf{x}_i\|\le R\), the perceptron makes at most \((R/\gamma)^2\) mistakes before converging — independent of dimension.

3.2.3 Minimal sketch

From-scratch NumPy — no nn_timeline imports yet. Train on AND (linearly separable) then XOR (not) to make the limitation visible.

import numpy as np

def perceptron_fit(X, y, lr=1.0, epochs=20):
    w = np.zeros(X.shape[1])
    b = 0.0
    mistakes = []
    for _ in range(epochs):
        m = 0
        for xi, yi in zip(X, y):
            if yi * (w @ xi + b) <= 0:
                w += lr * yi * xi
                b += lr * yi
                m += 1
        mistakes.append(m)
    return w, b, mistakes

def predict(X, w, b):
    return np.sign(X @ w + b)

AND — linearly separable:

X = np.array([[0,0],[0,1],[1,0],[1,1]], dtype=float)
y_and = np.array([-1, -1, -1, +1])
w, b, hist = perceptron_fit(X, y_and)
print("weights:", w, " bias:", b)
print("mistakes per epoch:", hist)
print("predictions:", predict(X, w, b))

weights: [3. 2.]  bias: -4.0
mistakes per epoch: [2, 3, 3, 2, 2, 3, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
predictions: [-1. -1. -1.  1.]

Mistakes drop to zero — the boundary cleanly separates the corners.

XOR — not linearly separable:

y_xor = np.array([-1, +1, +1, -1])
w, b, hist = perceptron_fit(X, y_xor, epochs=50)
print("last 10 epoch mistake counts:", hist[-10:])
print("predictions:", predict(X, w, b), "  true:", y_xor)

last 10 epoch mistake counts: [4, 4, 4, 4, 4, 4, 4, 4, 4, 4]
predictions: [0. 0. 0. 0.]   true: [-1  1  1 -1]

Mistakes never reach zero. No single line through the unit square separates the two diagonals — this is the wall.

3.2.4 What to observe

For AND the perceptron converges in a few epochs; mistakes go to zero.
For XOR mistakes oscillate forever — the update rule is working, the hypothesis class is the problem.
The boundary is linear in input space. Curved boundaries require either changed features (kernels) or composed neurons with a non-linearity in between (the MLP, next chapter).

Going deeper

Numerical walkthrough — perceptron updates on AND and XOR — every step of the update rule by hand, with a figure showing the boundary rotating across epochs.

3.3 Application & impact

The artificial neuron is legacy as a standalone learning machine — no one ships a single perceptron in 2026. But its DNA is in every modern model. Three threads survived; the XOR limitation forced the fourth.

3.3.1 What survived, what changed

Original piece	What it became	Where you see it today
Threshold \(\mathrm{sign}(\mathbf{w}^\top\mathbf{x}+b)\)	Smooth activation \(\phi\)	ReLU, GELU, SiLU in every transformer block
Mistake-driven update	Stochastic gradient descent	AdamW training a 70B-parameter LLM
Convergence bound \((R/\gamma)^2\)	Margin-based learning theory	SVM (Cortes & Vapnik, 1995), statistical learning theory
Single-layer limit (XOR)	Motivation for depth	Every multi-layer network — first formalized by (Rumelhart et al., 1986)

3.3.2 Concretely, where the lineage lands

Logistic regression = perceptron with \(\sigma\) instead of \(\mathrm{sign}\) and cross-entropy instead of the mistake rule. Still the linear baseline for tabular ML in 2026.
A single neuron inside a transformer is a perceptron with a smooth activation glued on: \(\mathrm{GELU}(\mathbf{w}^\top\mathbf{x}+b)\). Multiply by \(\sim 10^{11}\) and stack.
The optimizer that trains GPT-class models is a momentum-aware, adaptive-LR version of the perceptron update applied to a differentiable loss instead of a 0/1 mistake indicator.

Key takeaway

The artificial neuron pins down the three pieces every later architecture keeps: a parametric prediction function, a parameter-update rule, and a convergence story. The XOR wall is what forces depth and non-linearity into the picture — resolved in the chapters ahead, once the learning signal (what to optimize) and activations (the non-linearity) are established.

→ Next: Information Theory