The Context We Can't Afford to Lose: Neural Networks from First Principles

TL;DR — The artificial neuron is from 1943; backpropagation, 1986. Everything since — LSTM, Transformer, and the LLMs that followed — is the same unit cell applied to sequences and scaled. This primer builds the foundational math, then previews the arc to today's agentic systems.

The math under today’s largest models is decades old. McCulloch and Pitts wrote down the artificial neuron — a thresholded weighted sum — in 1943. Rosenblatt added a learning rule in 1958. The algorithm that trains deep networks, backpropagation, was popularized in 1986. What changed since is not the ideas — it is the scale.

This primer covers the Foundations segment (1943–1991) from first principles — the math every later era builds on — and previews the arc from RNNs through attention to the Transformer, LLMs, and today’s agentic systems. Each later era gets its own post; this is the map.

1943–1991 Foundations (1943–1991): MP neuron 1943 · Shannon 1948 · Perceptron 1958 · XOR limit 1969 · Backpropagation 1986 · Vanishing gradients 1991 Foundations MP neuron 1943 · Shannon 1948 Perceptron 1958 · XOR wall 1969 Backprop 1986 · vanishing ∇ 1991 1986–2014 RNN era (1986–2014): recurrent networks · LSTM 1997 · GRU and Seq2Seq 2014 RNN era Recurrent nets · LSTM 1997 GRU · Seq2Seq 2014 2014–2017 Attention (2014–2017): Bahdanau additive attention 2015 · Luong multiplicative attention 2015 Attention Bahdanau 2015 Luong 2015 2017–2020 Transformer (2017–2020): Transformer 2017 · BERT and GPT 2018 · GPT-2 2019 Transformer Transformer 2017 BERT · GPT 2018 2020–2024 LLM scaling (2020–2024): GPT-3 2020 · scaling laws · InstructGPT / RLHF 2022 · ChatGPT LLM scaling GPT-3 2020 · scaling laws RLHF / InstructGPT 2022 2024–2026 Agentic frontier (2024–2026): tool use · multi-step autonomy · reasoning models Agentic Tool use · multi-step planning Reasoning models 2024–26 2026

1. Why first principles, why a timeline

As the architecture, training, and the models themselves centralize into a handful of frontier labs, the broader community risks losing the conceptual context — the why it works and how the field got here. Without that, it is easy to drift toward consuming models rather than understanding them.

This primer is a small push in the other direction. Its bet is simple: understanding compounds when you see how each idea descends from the last. A timeline is not nostalgia — it is the most efficient way to learn, because every architecture is a response to the limitations of the one before it. The perceptron’s failure motivates the multilayer network; the multilayer network’s training difficulty motivates everything after.

So this is not a survey. It is a build. Every concept gets three things:

  • Derivable math — first principles, not “use this formula.”
  • A minimal runnable sketch — a few lines you can execute and poke at.
  • Its place in the lineage — what it inherited, and what it became.

Here is the foundational arc.

2. The unit cell

Strip any modern network to its smallest repeating part and you find the same object: a weighted sum followed by a non-linearity, $\phi(\mathbf{w}^\top\mathbf{x} + b)$. McCulloch and Pitts gave the structure (a thresholded sum that can compute logic).[1] Rosenblatt added the learning rule[2] — on each misclassified example $(\mathbf{x}_i, y_i)$ with $y_i \in {-1, +1}$, nudge the weights in the labeled direction:

\[\mathbf{w} \leftarrow \mathbf{w} + \eta\, y_i \mathbf{x}_i,\]

with a proof that, if the data is linearly separable, this converges.

That update rule is the ancestor of all of training. The threshold became the activation function; the mistake-driven nudge became gradient descent; the convergence proof became margin theory. The neuron is the cell; everything else is composition and scale.

The atomic unit: a weighted sum passed through a non-linearity. McCulloch–Pitts (1943) gave the structure; Rosenblatt (1958) gave the learning rule. Artificial neuron φ(w·x + b) Stack neurons into hidden layers and train them with backpropagation (1986). This is what clears the XOR wall. Composed (MLP) + hidden layers, backprop Scale the same loop — forward, loss, backward, update — to millions/billions of neurons: CNNs, RNNs, Transformers. Scaled CNN · RNN · Transformer compose scale

3. The first wall

The perceptron has a famous failure: it cannot learn XOR. No single straight line separates the diagonals of the unit square from the antidiagonals, and a single neuron draws exactly one line. Minsky and Papert proved this in 1969,[4] and the result chilled the field for years.

The escape is depth plus non-linearity. Add a hidden layer of neurons, each bending the input space, and a final layer can combine the pieces into the curved boundary XOR demands. Representational power was never the obstacle once you stacked neurons. The obstacle was training them — which is where the next two ideas come in.

4. Learning signals

Before you can train, you need to say how wrong a prediction is — not just whether it’s wrong. The perceptron’s 0/1 mistake count has no gradient. The fix comes from Claude Shannon’s 1948 information theory:[3] cross-entropy, the cost of a probabilistic prediction $q$ against a true label $p$,

\[H(p, q) = -\sum_x p(x)\log q(x).\]

It is smooth, differentiable, and brutally unforgiving of confident mistakes — exactly the properties a gradient-based learner needs. Cross-entropy is still the training objective of essentially every classifier and language model today; next-token prediction is cross-entropy over a vocabulary.

predicted probability q = P(y=1) loss −log q 0.1 0.5 0.9 confident + wrong uncertain correct −log q for a positive example (y=1) — smooth, penalizes confident mistakes without limit

5. Training depth

With a loss in hand, backpropagation (Rumelhart, Hinton & Williams, 1986)[5] computes every parameter’s gradient in one organized backward sweep — the chain rule, arranged so each layer’s error is computed once and reused. The error signal at layer $l$ obeys a single recursion:

\[\boldsymbol{\delta}^{(l)} = \big((\mathbf{W}^{(l+1)})^\top \boldsymbol{\delta}^{(l+1)}\big) \odot \phi'(\mathbf{z}^{(l)}),\]

and the weight gradient is $\partial\mathcal{L}/\partial\mathbf{W}^{(l)} = \boldsymbol{\delta}^{(l)}(\mathbf{a}^{(l-1)})^\top$. This is the algorithm behind loss.backward() in every framework, unchanged in principle since.

But backprop carries its own curse. It multiplies a factor per layer, so in a deep stack those factors compound — and gradients either vanish toward zero or explode. This single fact, identified by Hochreiter (1991)[6] and Bengio et al. (1994),[7] is the central obstacle to depth, and the reason much of what follows exists.

Backprop multiplies one factor per layer. If the typical factor is below 1 the gradient decays exponentially with depth (vanish); above 1 it grows without bound (explode). The hatched wedge is the widening spread. depth (layers) → ‖gradient‖ explode vanish

A remarkable share of architectural innovation is, at heart, a single question: how to keep the per-layer factor near 1 so gradients survive the trip back? The recurring answers:

  • LSTM / GRU gating — an additive memory path that doesn’t multiply the gradient away.
  • Residual connections — an identity shortcut that always offers a factor-1 route.
  • Normalization (Batch / Layer / RMS) — re-center activations so units don’t saturate.
  • Non-saturating activations (ReLU and kin) — derivative 1 for active units, no 0.25 shrink.

Hold that question; it organizes the rest of the timeline.

6. From foundations to the frontier

The four ideas above are the whole game in miniature — represent, measure, differentiate, keep gradients alive. Everything after is a way of applying them to sequences and at scale.

Sequences to the Transformer

Sequences → attention. RNNs (and the LSTM, 1997)[8] process tokens one at a time, carrying a hidden state — but a fixed-size state is a bottleneck for long inputs. Attention (Bahdanau, 2015)[9] fixed this by letting the model look back at all prior states, weighted by relevance.

Attention → Transformer. In 2017 the Transformer[10] threw out recurrence entirely and kept only attention. Its core operation, self-attention, is one equation:

\[\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V.\]

Each token builds a query, compares it against every other token’s key (a similarity score), and reads a weighted blend of their values. It is fully parallel — which is exactly what made training at scale practical.

output₂ = Σ αᵢ · Vᵢ α = 0.15 α = 0.70 α = 0.15 αᵢ = softmax( Q₂ · Kᵢᵀ / √d ) token 1 token 2 · query token 3 K · V Q · K · V K · V

Scale, LLMs, and alignment

Transformer → LLMs. From there the story is mostly scale plus objective:

  • Pretraining is just cross-entropy (section 4) over a vocabulary, predicting the next token across trillions of them.
  • BERT[11] vs GPT[12] (2018) split on objective: masked-token (bidirectional) vs next-token (causal). The causal branch became the LLM lineage.
  • Scaling laws[13] (2020) turned “make it bigger” into a quantitative recipe; GPT-3[14] showed the payoff.
  • RLHF / instruction tuning[15] (2022) aligned raw next-token predictors into assistants — adding a preference signal on top of cross-entropy.

LLMs → today. The 2024–2026 frontier moved from single answers to agentic systems: tool use, multi-step planning, and reasoning models that spend compute at inference time. The unit cell hasn’t changed — it is still $\phi(\mathbf{w}^\top\mathbf{x}+b)$, trained by backprop on cross-entropy. What changed is the orchestration around it.

The formalizations behind these later stages — softmax for multi-class outputs, scaling laws for the empirical $L \propto N^{-\alpha}$ relationship, RLHF’s reward + KL objective — are deferred to their respective deep-dives, where they motivate themselves naturally.

① Pretrain ② SFT ③ Reward model ④ RLHF / PPO
① Pretrain ② SFT ③ Reward model ④ RLHF / PPO
next-token prediction on large corpus
loss: −log P(next | context)
fine-tune on human demonstrations
same CE on (prompt, response)
train on preference rankings
learns r(prompt, response) → scalar
maximize reward, stay near SFT policy
loss: −E[r] + β·KL(π ‖ π₀)

Shannon's cross-entropy (1948) is the backbone at every stage — RLHF adds a KL penalty to keep the policy close to the reference

Every box in this section is a deep-dive in its own right — and each will get its own post. This primer is the map; the territory comes next.

7. Takeaways

  1. The artificial neuron — $\phi(\mathbf{w}^\top\mathbf{x}+b)$ — is the unit cell of every model. Everything else is composition and scale.
  2. The perceptron’s XOR wall is what forced depth and non-linearity.
  3. Cross-entropy (Shannon, 1948) is the learning signal that makes gradient descent possible — and still trains every LLM.
  4. Backprop makes depth trainable; vanishing/exploding gradients make it hard — and motivate most of what came next.
  5. The modern arc — attention, the Transformer, LLMs, agents — is these same ideas applied to sequences and at scale. The unit cell never changed.

A handful of ideas, spanning 1943 to today, sit under every model in production. Understand them once, from first principles, and the modern frontier stops looking like magic and starts looking like consequences.

8. What’s next in this series

This post is the map. Each segment of the timeline above becomes its own deep-dive — derivation, runnable code, and history — in upcoming posts:

  • The RNN era — recurrence, LSTM gating, and the sequence bottleneck.
  • Attention — Bahdanau and Luong, additive vs multiplicative, the alignment story.
  • The Transformer — self-attention line by line, multi-head, positional encoding.
  • The LLM era — pretraining, scaling laws, and alignment (RLHF / DPO).
  • The agentic frontier — tools, planning, and reasoning at inference time.

They all build on the foundations here, so this is the post to read first.

End note — this post distills the nn-timeline book, Neural Network Architectures Through Time; full formalizations and runnable reference code live in the repo. Read the draft nn-timeline book.


References

1. Foundations

  1. McCulloch & Pitts (1943), A Logical Calculus of the Ideas Immanent in Nervous Activity, Bulletin of Mathematical Biophysics.
  2. Rosenblatt (1958), The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain, Psychological Review.
  3. Shannon (1948), A Mathematical Theory of Communication, Bell System Technical Journal.
  4. Minsky & Papert (1969), Perceptrons, MIT Press.

2. Training depth

  1. Rumelhart, Hinton & Williams (1986), Learning Representations by Back-propagating Errors, Nature 323.
  2. Hochreiter (1991), Untersuchungen zu dynamischen neuronalen Netzen (vanishing gradients), Diploma thesis, TU Munich.
  3. Bengio, Simard & Frasconi (1994), Learning Long-Term Dependencies with Gradient Descent is Difficult, IEEE Trans. Neural Networks.

3. Sequences, attention and scale

  1. Hochreiter & Schmidhuber (1997), Long Short-Term Memory, Neural Computation.
  2. Bahdanau, Cho & Bengio (2015), Neural Machine Translation by Jointly Learning to Align and Translate.
  3. Vaswani et al. (2017), Attention Is All You Need.
  4. Devlin et al. (2018), BERT: Pre-training of Deep Bidirectional Transformers.
  5. Radford et al. (2018), Improving Language Understanding by Generative Pre-Training (GPT-1).
  6. Kaplan et al. (2020), Scaling Laws for Neural Language Models.
  7. Brown et al. (2020), Language Models are Few-Shot Learners (GPT-3).
  8. Ouyang et al. (2022), Training Language Models to Follow Instructions with Human Feedback (InstructGPT / RLHF).

← All posts