3  The Artificial Neuron — From McCulloch-Pitts to the Perceptron

TipTL;DR

An artificial neuron is a weighted sum followed by a non-linearity: \(\hat{y} = \phi(\mathbf{w}^\top\mathbf{x} + b)\). McCulloch & Pitts (1943) fixed the structure — no learning. Rosenblatt (1958) added the learning rule and a convergence proof. Together they are the seed every later architecture grows from.

Depends on: Before the Perceptron — Prehistory

3.1 Why this matters

Two ideas had to fuse before anything “neural” could exist as a learning machine: what a neuron is (a thresholded weighted sum) and how it changes when shown labelled examples (a parameter update rule). McCulloch & Pitts (McCulloch & Pitts, 1943) gave the first; Rosenblatt (Rosenblatt, 1958) gave the second. Every architecture in this timeline composes, smooths, or scales this same skeleton — the perceptron is not just history, it is the unit cell.

It also matters for what it cannot do. Minsky & Papert (Minsky & Papert, 1969) proved that a single-layer perceptron cannot learn XOR — one linear boundary cannot separate non-linearly-separable classes. That single negative result motivates the multi-layer perceptron, and through it, every deep network.

NoteThe setup: supervised learning

Everything in this chapter (and most of the book) lives in the supervised learning paradigm. The setup is a dataset of labelled examples \(\{(\mathbf{x}_i, y_i)\}\) — each input \(\mathbf{x}_i\) (the features) paired with a target \(y_i\) (the label). The model is a parametric function \(f_\theta\); learning means adjusting parameters \(\theta\) so \(f_\theta(\mathbf{x})\) matches \(y\) on examples it has seen (training) and, crucially, on examples it has not (generalization, measured on a held-out test set). The perceptron is the simplest instance: \(\theta = (\mathbf{w}, b)\), and “matches” means the predicted sign equals the label.

3.2 The mechanism

3.2.1 Step 1 — McCulloch-Pitts neuron: structure, no learning

A McCulloch-Pitts (MP) unit takes binary inputs \(x_i \in \{0,1\}\), weights them (originally with \(\pm 1\)), sums, and thresholds:

\[ y = \mathbb{1}\!\left[\sum_i w_i x_i \ge \theta\right] \]

The weights and threshold are fixed by the designer. Networks of such units can implement any propositional logic function — but there is no mechanism for the network to change itself from data.

WarningPitfall — MP is not a learning machine

MP neurons are often described as “the first neural network”, which is true structurally but misleading practically. They cannot learn; their weights are programmed, not trained. The leap to learning is exactly what the perceptron adds.

3.2.2 Step 2 — Perceptron: same structure + a learning rule

Rosenblatt’s perceptron generalises MP to real-valued inputs and weights, swaps the indicator for a sign function, and crucially introduces an update rule:

\[ \hat{y} = \mathrm{sign}(\mathbf{w}^\top \mathbf{x} + b) \quad\in\{-1,+1\} \]

Geometrically \(\mathbf{w}^\top\mathbf{x}+b=0\) is a hyperplane in \(\mathbb{R}^d\); the perceptron classifies by which side a point falls on. Given a labelled example \((\mathbf{x}_i, y_i)\) with \(y_i \in \{-1,+1\}\), the perceptron updates only when it makes a mistake:

\[ \text{if } y_i(\mathbf{w}^\top\mathbf{x}_i + b) \le 0: \quad \mathbf{w} \leftarrow \mathbf{w} + \eta\, y_i \mathbf{x}_i, \quad b \leftarrow b + \eta\, y_i \]

Intuition: a misclassified positive pushes \(\mathbf{w}\) toward it; a misclassified negative pushes it away. The boundary rotates until it separates the data — if separation is possible.

Convergence (Novikoff, 1962). If the data is linearly separable with margin \(\gamma>0\) and \(\|\mathbf{x}_i\|\le R\), the perceptron makes at most \((R/\gamma)^2\) mistakes before converging — independent of dimension.

3.2.3 Minimal sketch

From-scratch NumPy — no nn_timeline imports yet. Train on AND (linearly separable) then XOR (not) to make the limitation visible.

import numpy as np

def perceptron_fit(X, y, lr=1.0, epochs=20):
    w = np.zeros(X.shape[1])
    b = 0.0
    mistakes = []
    for _ in range(epochs):
        m = 0
        for xi, yi in zip(X, y):
            if yi * (w @ xi + b) <= 0:
                w += lr * yi * xi
                b += lr * yi
                m += 1
        mistakes.append(m)
    return w, b, mistakes

def predict(X, w, b):
    return np.sign(X @ w + b)

AND — linearly separable:

X = np.array([[0,0],[0,1],[1,0],[1,1]], dtype=float)
y_and = np.array([-1, -1, -1, +1])
w, b, hist = perceptron_fit(X, y_and)
print("weights:", w, " bias:", b)
print("mistakes per epoch:", hist)
print("predictions:", predict(X, w, b))
weights: [3. 2.]  bias: -4.0
mistakes per epoch: [2, 3, 3, 2, 2, 3, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
predictions: [-1. -1. -1.  1.]

Mistakes drop to zero — the boundary cleanly separates the corners.

XOR — not linearly separable:

y_xor = np.array([-1, +1, +1, -1])
w, b, hist = perceptron_fit(X, y_xor, epochs=50)
print("last 10 epoch mistake counts:", hist[-10:])
print("predictions:", predict(X, w, b), "  true:", y_xor)
last 10 epoch mistake counts: [4, 4, 4, 4, 4, 4, 4, 4, 4, 4]
predictions: [0. 0. 0. 0.]   true: [-1  1  1 -1]

Mistakes never reach zero. No single line through the unit square separates the two diagonals — this is the wall.

3.2.4 What to observe

  • For AND the perceptron converges in a few epochs; mistakes go to zero.
  • For XOR mistakes oscillate forever — the update rule is working, the hypothesis class is the problem.
  • The boundary is linear in input space. Curved boundaries require either changed features (kernels) or composed neurons with a non-linearity in between (the MLP, next chapter).
TipGoing deeper

3.3 Application & impact

The artificial neuron is legacy as a standalone learning machine — no one ships a single perceptron in 2026. But its DNA is in every modern model. Three threads survived; the XOR limitation forced the fourth.

Artificial neuron McCulloch-Pitts 1943 Perceptron 1958 MLP + backprop (1986) add hidden layer + smooth ϕ SVM (1995) maximize the margin Modern SGD / AdamW any differentiable loss Deep nets 2012+ CNN · RNN · Transformer Statistical learning theory VC bounds, kernel methods LLM training trains every modern model

3.3.1 What survived, what changed

Original piece What it became Where you see it today
Threshold \(\mathrm{sign}(\mathbf{w}^\top\mathbf{x}+b)\) Smooth activation \(\phi\) ReLU, GELU, SiLU in every transformer block
Mistake-driven update Stochastic gradient descent AdamW training a 70B-parameter LLM
Convergence bound \((R/\gamma)^2\) Margin-based learning theory SVM (Cortes & Vapnik, 1995), statistical learning theory
Single-layer limit (XOR) Motivation for depth Every multi-layer network — first formalized by (Rumelhart et al., 1986)

3.3.2 Concretely, where the lineage lands

  • Logistic regression = perceptron with \(\sigma\) instead of \(\mathrm{sign}\) and cross-entropy instead of the mistake rule. Still the linear baseline for tabular ML in 2026.
  • A single neuron inside a transformer is a perceptron with a smooth activation glued on: \(\mathrm{GELU}(\mathbf{w}^\top\mathbf{x}+b)\). Multiply by \(\sim 10^{11}\) and stack.
  • The optimizer that trains GPT-class models is a momentum-aware, adaptive-LR version of the perceptron update applied to a differentiable loss instead of a 0/1 mistake indicator.
NoteKey takeaway

The artificial neuron pins down the three pieces every later architecture keeps: a parametric prediction function, a parameter-update rule, and a convergence story. The XOR wall is what forces depth and non-linearity into the picture — resolved in the chapters ahead, once the learning signal (what to optimize) and activations (the non-linearity) are established.

→ Next: Information Theory