The Context We Can't Afford to Lose: Neural Networks from First Principles
The math under today’s largest models is decades old. McCulloch and Pitts wrote down the artificial neuron — a thresholded weighted sum — in 1943. Rosenblatt added a learning rule in 1958. The algorithm that trains deep networks, backpropagation, was popularized in 1986. What changed since is not the ideas — it is the scale.
This primer covers the Foundations segment (1943–1991) from first principles — the math every later era builds on — and previews the arc from RNNs through attention to the Transformer, LLMs, and today’s agentic systems. Each later era gets its own post; this is the map.
1. Why first principles, why a timeline
As the architecture, training, and the models themselves centralize into a handful of frontier labs, the broader community risks losing the conceptual context — the why it works and how the field got here. Without that, it is easy to drift toward consuming models rather than understanding them.
This primer is a small push in the other direction. Its bet is simple: understanding compounds when you see how each idea descends from the last. A timeline is not nostalgia — it is the most efficient way to learn, because every architecture is a response to the limitations of the one before it. The perceptron’s failure motivates the multilayer network; the multilayer network’s training difficulty motivates everything after.
So this is not a survey. It is a build. Every concept gets three things:
- Derivable math — first principles, not “use this formula.”
- A minimal runnable sketch — a few lines you can execute and poke at.
- Its place in the lineage — what it inherited, and what it became.
Here is the foundational arc.
2. The unit cell
Strip any modern network to its smallest repeating part and you find the same object: a weighted sum followed by a non-linearity, $\phi(\mathbf{w}^\top\mathbf{x} + b)$. McCulloch and Pitts gave the structure (a thresholded sum that can compute logic).[1] Rosenblatt added the learning rule[2] — on each misclassified example $(\mathbf{x}_i, y_i)$ with $y_i \in {-1, +1}$, nudge the weights in the labeled direction:
\[\mathbf{w} \leftarrow \mathbf{w} + \eta\, y_i \mathbf{x}_i,\]with a proof that, if the data is linearly separable, this converges.
That update rule is the ancestor of all of training. The threshold became the activation function; the mistake-driven nudge became gradient descent; the convergence proof became margin theory. The neuron is the cell; everything else is composition and scale.
3. The first wall
The perceptron has a famous failure: it cannot learn XOR. No single straight line separates the diagonals of the unit square from the antidiagonals, and a single neuron draws exactly one line. Minsky and Papert proved this in 1969,[4] and the result chilled the field for years.
The escape is depth plus non-linearity. Add a hidden layer of neurons, each bending the input space, and a final layer can combine the pieces into the curved boundary XOR demands. Representational power was never the obstacle once you stacked neurons. The obstacle was training them — which is where the next two ideas come in.
4. Learning signals
Before you can train, you need to say how wrong a prediction is — not just whether it’s wrong. The perceptron’s 0/1 mistake count has no gradient. The fix comes from Claude Shannon’s 1948 information theory:[3] cross-entropy, the cost of a probabilistic prediction $q$ against a true label $p$,
\[H(p, q) = -\sum_x p(x)\log q(x).\]It is smooth, differentiable, and brutally unforgiving of confident mistakes — exactly the properties a gradient-based learner needs. Cross-entropy is still the training objective of essentially every classifier and language model today; next-token prediction is cross-entropy over a vocabulary.
5. Training depth
With a loss in hand, backpropagation (Rumelhart, Hinton & Williams, 1986)[5] computes every parameter’s gradient in one organized backward sweep — the chain rule, arranged so each layer’s error is computed once and reused. The error signal at layer $l$ obeys a single recursion:
\[\boldsymbol{\delta}^{(l)} = \big((\mathbf{W}^{(l+1)})^\top \boldsymbol{\delta}^{(l+1)}\big) \odot \phi'(\mathbf{z}^{(l)}),\]and the weight gradient is $\partial\mathcal{L}/\partial\mathbf{W}^{(l)} = \boldsymbol{\delta}^{(l)}(\mathbf{a}^{(l-1)})^\top$. This is the algorithm behind
loss.backward() in every framework, unchanged in principle since.
But backprop carries its own curse. It multiplies a factor per layer, so in a deep stack those factors compound — and gradients either vanish toward zero or explode. This single fact, identified by Hochreiter (1991)[6] and Bengio et al. (1994),[7] is the central obstacle to depth, and the reason much of what follows exists.
A remarkable share of architectural innovation is, at heart, a single question: how to keep the per-layer factor near 1 so gradients survive the trip back? The recurring answers:
- LSTM / GRU gating — an additive memory path that doesn’t multiply the gradient away.
- Residual connections — an identity shortcut that always offers a factor-1 route.
- Normalization (Batch / Layer / RMS) — re-center activations so units don’t saturate.
- Non-saturating activations (ReLU and kin) — derivative 1 for active units, no 0.25 shrink.
Hold that question; it organizes the rest of the timeline.
6. From foundations to the frontier
The four ideas above are the whole game in miniature — represent, measure, differentiate, keep gradients alive. Everything after is a way of applying them to sequences and at scale.
Sequences to the Transformer
Sequences → attention. RNNs (and the LSTM, 1997)[8] process tokens one at a time, carrying a hidden state — but a fixed-size state is a bottleneck for long inputs. Attention (Bahdanau, 2015)[9] fixed this by letting the model look back at all prior states, weighted by relevance.
Attention → Transformer. In 2017 the Transformer[10] threw out recurrence entirely and kept only attention. Its core operation, self-attention, is one equation:
\[\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V.\]Each token builds a query, compares it against every other token’s key (a similarity score), and reads a weighted blend of their values. It is fully parallel — which is exactly what made training at scale practical.
Scale, LLMs, and alignment
Transformer → LLMs. From there the story is mostly scale plus objective:
- Pretraining is just cross-entropy (section 4) over a vocabulary, predicting the next token across trillions of them.
- BERT[11] vs GPT[12] (2018) split on objective: masked-token (bidirectional) vs next-token (causal). The causal branch became the LLM lineage.
- Scaling laws[13] (2020) turned “make it bigger” into a quantitative recipe; GPT-3[14] showed the payoff.
- RLHF / instruction tuning[15] (2022) aligned raw next-token predictors into assistants — adding a preference signal on top of cross-entropy.
LLMs → today. The 2024–2026 frontier moved from single answers to agentic systems: tool use, multi-step planning, and reasoning models that spend compute at inference time. The unit cell hasn’t changed — it is still $\phi(\mathbf{w}^\top\mathbf{x}+b)$, trained by backprop on cross-entropy. What changed is the orchestration around it.
The formalizations behind these later stages — softmax for multi-class outputs, scaling laws for the empirical $L \propto N^{-\alpha}$ relationship, RLHF’s reward + KL objective — are deferred to their respective deep-dives, where they motivate themselves naturally.
| ① Pretrain | ② SFT | ③ Reward model | ④ RLHF / PPO |
|---|---|---|---|
| next-token prediction on large corpus loss: −log P(next | context) |
fine-tune on human demonstrations same CE on (prompt, response) |
train on preference rankings learns r(prompt, response) → scalar |
maximize reward, stay near SFT policy loss: −E[r] + β·KL(π ‖ π₀) |
Shannon's cross-entropy (1948) is the backbone at every stage — RLHF adds a KL penalty to keep the policy close to the reference
Every box in this section is a deep-dive in its own right — and each will get its own post. This primer is the map; the territory comes next.
7. Takeaways
- The artificial neuron — $\phi(\mathbf{w}^\top\mathbf{x}+b)$ — is the unit cell of every model. Everything else is composition and scale.
- The perceptron’s XOR wall is what forced depth and non-linearity.
- Cross-entropy (Shannon, 1948) is the learning signal that makes gradient descent possible — and still trains every LLM.
- Backprop makes depth trainable; vanishing/exploding gradients make it hard — and motivate most of what came next.
- The modern arc — attention, the Transformer, LLMs, agents — is these same ideas applied to sequences and at scale. The unit cell never changed.
A handful of ideas, spanning 1943 to today, sit under every model in production. Understand them once, from first principles, and the modern frontier stops looking like magic and starts looking like consequences.
8. What’s next in this series
This post is the map. Each segment of the timeline above becomes its own deep-dive — derivation, runnable code, and history — in upcoming posts:
- The RNN era — recurrence, LSTM gating, and the sequence bottleneck.
- Attention — Bahdanau and Luong, additive vs multiplicative, the alignment story.
- The Transformer — self-attention line by line, multi-head, positional encoding.
- The LLM era — pretraining, scaling laws, and alignment (RLHF / DPO).
- The agentic frontier — tools, planning, and reasoning at inference time.
They all build on the foundations here, so this is the post to read first.
End note — this post distills the nn-timeline book, Neural Network Architectures Through Time; full formalizations and runnable reference code live in the repo. Read the draft nn-timeline book.
References
1. Foundations
- McCulloch & Pitts (1943), A Logical Calculus of the Ideas Immanent in Nervous Activity, Bulletin of Mathematical Biophysics. ↩
- Rosenblatt (1958), The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain, Psychological Review. ↩
- Shannon (1948), A Mathematical Theory of Communication, Bell System Technical Journal. ↩
- Minsky & Papert (1969), Perceptrons, MIT Press. ↩
2. Training depth
- Rumelhart, Hinton & Williams (1986), Learning Representations by Back-propagating Errors, Nature 323. ↩
- Hochreiter (1991), Untersuchungen zu dynamischen neuronalen Netzen (vanishing gradients), Diploma thesis, TU Munich. ↩
- Bengio, Simard & Frasconi (1994), Learning Long-Term Dependencies with Gradient Descent is Difficult, IEEE Trans. Neural Networks. ↩
3. Sequences, attention and scale
- Hochreiter & Schmidhuber (1997), Long Short-Term Memory, Neural Computation. ↩
- Bahdanau, Cho & Bengio (2015), Neural Machine Translation by Jointly Learning to Align and Translate. ↩
- Vaswani et al. (2017), Attention Is All You Need. ↩
- Devlin et al. (2018), BERT: Pre-training of Deep Bidirectional Transformers. ↩
- Radford et al. (2018), Improving Language Understanding by Generative Pre-Training (GPT-1). ↩
- Kaplan et al. (2020), Scaling Laws for Neural Language Models. ↩
- Brown et al. (2020), Language Models are Few-Shot Learners (GPT-3). ↩
- Ouyang et al. (2022), Training Language Models to Follow Instructions with Human Feedback (InstructGPT / RLHF). ↩