import numpy as np
def perceptron_fit(X, y, lr=1.0, epochs=20):
w = np.zeros(X.shape[1])
b = 0.0
mistakes = []
for _ in range(epochs):
m = 0
for xi, yi in zip(X, y):
if yi * (w @ xi + b) <= 0:
w += lr * yi * xi
b += lr * yi
m += 1
mistakes.append(m)
return w, b, mistakes
def predict(X, w, b):
return np.sign(X @ w + b)3 The Artificial Neuron — From McCulloch-Pitts to the Perceptron
Depends on: Before the Perceptron — Prehistory
3.1 Why this matters
Two ideas had to fuse before anything “neural” could exist as a learning machine: what a neuron is (a thresholded weighted sum) and how it changes when shown labelled examples (a parameter update rule). McCulloch & Pitts (McCulloch & Pitts, 1943) gave the first; Rosenblatt (Rosenblatt, 1958) gave the second. Every architecture in this timeline composes, smooths, or scales this same skeleton — the perceptron is not just history, it is the unit cell.
It also matters for what it cannot do. Minsky & Papert (Minsky & Papert, 1969) proved that a single-layer perceptron cannot learn XOR — one linear boundary cannot separate non-linearly-separable classes. That single negative result motivates the multi-layer perceptron, and through it, every deep network.
3.2 The mechanism
3.2.1 Step 1 — McCulloch-Pitts neuron: structure, no learning
A McCulloch-Pitts (MP) unit takes binary inputs \(x_i \in \{0,1\}\), weights them (originally with \(\pm 1\)), sums, and thresholds:
\[ y = \mathbb{1}\!\left[\sum_i w_i x_i \ge \theta\right] \]
The weights and threshold are fixed by the designer. Networks of such units can implement any propositional logic function — but there is no mechanism for the network to change itself from data.
3.2.2 Step 2 — Perceptron: same structure + a learning rule
Rosenblatt’s perceptron generalises MP to real-valued inputs and weights, swaps the indicator for a sign function, and crucially introduces an update rule:
\[ \hat{y} = \mathrm{sign}(\mathbf{w}^\top \mathbf{x} + b) \quad\in\{-1,+1\} \]
Geometrically \(\mathbf{w}^\top\mathbf{x}+b=0\) is a hyperplane in \(\mathbb{R}^d\); the perceptron classifies by which side a point falls on. Given a labelled example \((\mathbf{x}_i, y_i)\) with \(y_i \in \{-1,+1\}\), the perceptron updates only when it makes a mistake:
\[ \text{if } y_i(\mathbf{w}^\top\mathbf{x}_i + b) \le 0: \quad \mathbf{w} \leftarrow \mathbf{w} + \eta\, y_i \mathbf{x}_i, \quad b \leftarrow b + \eta\, y_i \]
Intuition: a misclassified positive pushes \(\mathbf{w}\) toward it; a misclassified negative pushes it away. The boundary rotates until it separates the data — if separation is possible.
Convergence (Novikoff, 1962). If the data is linearly separable with margin \(\gamma>0\) and \(\|\mathbf{x}_i\|\le R\), the perceptron makes at most \((R/\gamma)^2\) mistakes before converging — independent of dimension.
3.2.3 Minimal sketch
From-scratch NumPy — no nn_timeline imports yet. Train on AND (linearly separable) then XOR (not) to make the limitation visible.
AND — linearly separable:
X = np.array([[0,0],[0,1],[1,0],[1,1]], dtype=float)
y_and = np.array([-1, -1, -1, +1])
w, b, hist = perceptron_fit(X, y_and)
print("weights:", w, " bias:", b)
print("mistakes per epoch:", hist)
print("predictions:", predict(X, w, b))weights: [3. 2.] bias: -4.0
mistakes per epoch: [2, 3, 3, 2, 2, 3, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
predictions: [-1. -1. -1. 1.]
Mistakes drop to zero — the boundary cleanly separates the corners.
XOR — not linearly separable:
y_xor = np.array([-1, +1, +1, -1])
w, b, hist = perceptron_fit(X, y_xor, epochs=50)
print("last 10 epoch mistake counts:", hist[-10:])
print("predictions:", predict(X, w, b), " true:", y_xor)last 10 epoch mistake counts: [4, 4, 4, 4, 4, 4, 4, 4, 4, 4]
predictions: [0. 0. 0. 0.] true: [-1 1 1 -1]
Mistakes never reach zero. No single line through the unit square separates the two diagonals — this is the wall.
3.2.4 What to observe
- For AND the perceptron converges in a few epochs; mistakes go to zero.
- For XOR mistakes oscillate forever — the update rule is working, the hypothesis class is the problem.
- The boundary is linear in input space. Curved boundaries require either changed features (kernels) or composed neurons with a non-linearity in between (the MLP, next chapter).
3.3 Application & impact
The artificial neuron is legacy as a standalone learning machine — no one ships a single perceptron in 2026. But its DNA is in every modern model. Three threads survived; the XOR limitation forced the fourth.
3.3.1 What survived, what changed
| Original piece | What it became | Where you see it today |
|---|---|---|
| Threshold \(\mathrm{sign}(\mathbf{w}^\top\mathbf{x}+b)\) | Smooth activation \(\phi\) | ReLU, GELU, SiLU in every transformer block |
| Mistake-driven update | Stochastic gradient descent | AdamW training a 70B-parameter LLM |
| Convergence bound \((R/\gamma)^2\) | Margin-based learning theory | SVM (Cortes & Vapnik, 1995), statistical learning theory |
| Single-layer limit (XOR) | Motivation for depth | Every multi-layer network — first formalized by (Rumelhart et al., 1986) |
3.3.2 Concretely, where the lineage lands
- Logistic regression = perceptron with \(\sigma\) instead of \(\mathrm{sign}\) and cross-entropy instead of the mistake rule. Still the linear baseline for tabular ML in 2026.
- A single neuron inside a transformer is a perceptron with a smooth activation glued on: \(\mathrm{GELU}(\mathbf{w}^\top\mathbf{x}+b)\). Multiply by \(\sim 10^{11}\) and stack.
- The optimizer that trains GPT-class models is a momentum-aware, adaptive-LR version of the perceptron update applied to a differentiable loss instead of a 0/1 mistake indicator.
→ Next: Information Theory