6 Alignment

TL;DR

Alignment addresses a failure that needs no attacker: a competent model optimizing the wrong objective.
It splits in two. Outer: is the specified objective right? Inner: did the model actually internalize it?
The practical stack runs IRL → RLHF → Constitutional AI → DPO, each scaling supervision further with less human labeling.
None of it solves alignment. It makes models more agreeable, which is not the same thing.
Under agency, alignment assumptions built for single-turn models start to break.

The first of the five pillars in the landscape: making a system want what its designers intend. Where Robustness & Security defends the system against external attackers, alignment addresses a failure that needs no attacker — a competent model optimizing the wrong objective.

6.1 The Alignment Problem

Two failures, present before any deep-learning system existed:

Outer alignment — does the specified objective capture what we actually want? A reward function is a proxy; optimizing a proxy hard enough surfaces the gap between proxy and intent. This is the lineage of Wiener (1960)’s warning that a machine pursuing a literally-specified goal may pursue it past the point we would have stopped it.
Inner alignment — does the trained model internalize the specified objective, or merely a correlate that held on the training distribution? A model can score perfectly in training while having learned a goal that diverges off-distribution.

Key idea

Outer = the objective is wrong. Inner = the objective is right but the model learned something else. Every alignment technique in this chapter attacks outer alignment. Inner alignment is why interpretability and monitoring exist: you cannot check it from behavior alone.

Two distinct gaps. Outer alignment concerns whether the objective matches intent; inner alignment whether the trained model internalizes the objective.

Specification gaming is the outer failure made concrete: agents satisfy the literal objective while violating its purpose — the boat-racing agent that loops to collect reward pellets instead of finishing the course is the canonical illustration. Amodei et al. (2016) catalogued these as concrete problems: negative side effects, reward hacking, scalable oversight, safe exploration, and distributional robustness — the agenda the field has worked through since.

Instrumental convergence (Omohundro, 2008) explains why misalignment is dangerous rather than merely wrong: a wide range of terminal goals share instrumental sub-goals — self-preservation, resource acquisition, goal-preservation. A capable system pursuing almost any objective has reason to resist shutdown and accumulate capability, which is what makes a small specification gap a safety problem rather than a quality bug (Bostrom, 2014).

Inverse reinforcement learning (Ng & Russell, 2000) framed an early response: rather than hand-specify reward, infer it from demonstrated behavior — the conceptual seed of preference-based alignment. Given an expert policy \(\pi^*\) and a reward class \(\mathcal{R}\), IRL recovers:

\[ R^* = \arg\max_{R \in \mathcal{R}}\; \mathbb{E}_{\pi^*}\!\left[\sum_t R(s_t)\right] - \max_{\pi}\;\mathbb{E}_\pi\!\left[\sum_t R(s_t)\right] \]

The first term rewards the expert’s behavior; the second penalizes rewards for which some other policy would score higher. The gap is maximized when \(R^*\) makes the expert uniquely optimal — which is what we want, but not uniquely achievable: any positive scalar multiple of \(R^*\) solves the same problem (reward ambiguity, which preference learning inherits).

6.2 Learning from Feedback

If reward cannot be specified, it can be learned from comparisons. RLHF — RL from human feedback (Christiano et al., 2017) — collects human preference judgments over pairs of model outputs, fits a reward model to those preferences, and optimizes the policy against the learned reward. Ouyang et al. (2022) (InstructGPT) scaled this to language models and made it the production standard: the alignment step that turned a capable next-token predictor into a model that follows instructions.

RLHF — a learned reward model trained on human preference comparisons supplies the signal that policy optimization then maximizes.

The learning objective is KL-regularized RL: maximize expected reward while staying close to the reference (SFT) policy:

\[ \max_\pi\; \mathbb{E}_{x \sim \mathcal{D}}\, \mathbb{E}_{y \sim \pi(\cdot|x)}\!\bigl[r_\phi(x,y)\bigr] \;-\; \beta\,\mathrm{KL}\!\bigl[\pi(\cdot\mid x)\;\|\;\pi_{\text{ref}}(\cdot\mid x)\bigr] \]

The \(\beta\) coefficient controls how far policy optimization can move from the SFT baseline. Too small: reward over-optimization (Goodharting on \(r_\phi\)). Too large: alignment training barely moves the policy. In practice \(r_\phi\) is the Bradley–Terry reward model fit to human preference comparisons (derivation in the supplement).

RLHF moved the specification problem rather than dissolving it. The reward model is itself a proxy — a model of human preference — and optimizing hard against it produces reward hacking at one remove: outputs that the reward model scores highly but humans, on reflection, do not endorse (sycophancy, confident-sounding fabrication, verbosity that games the rater). The proxy gap of Amodei et al. (2016) reappears one level up.

Pitfall

RLHF is not alignment. It is easy to read the technique as solving the problem, but it only moves it: the reward model is itself a proxy, so the specification gap reappears one level up. A model that has been RLHF’d is reliably more agreeable, which is not the same thing as being aligned. The two come apart where it matters most.

6.3 Scaling Supervision

Human preference labels are expensive and inconsistent. Two responses reduce the human burden:

Constitutional AI (Bai et al., 2022) replaces most human harm-labels with a written set of principles. The model critiques and revises its own outputs against the constitution (supervised phase), then a preference model trained on AI judgments drives RL — RLAIF. Alignment targets become legible natural-language rules rather than an implicit signal buried in thousands of labels.
Direct Preference Optimization (Rafailov et al., 2023) — same preference target as RLHF, derived differently. The KL-constrained objective has a closed-form optimum; substituting it into the Bradley–Terry likelihood lets the partition function cancel, yielding a direct classification loss with no separate reward-model training step:

The DPO loss (full derivation in the supplement):

\[ \mathcal{L}_{\text{DPO}} = -\mathbb{E}_{(x,\,y_w,\,y_l)}\!\left[ \log\sigma\!\left( \beta\log\frac{\pi_\theta(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)} - \beta\log\frac{\pi_\theta(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)} \right) \right] \]

The policy \(\pi_\theta\) serves as its own implicit reward model — far less machinery than RLHF, same preference target.

Method	Signal source	Reward model	Optimization
RLHF (Christiano et al., 2017)	human comparisons	explicit	RL (PPO)
Constitutional AI (Bai et al., 2022)	AI critique vs. written principles	explicit (AI-labeled)	RL (RLAIF)
DPO (Rafailov et al., 2023)	human/AI comparisons	implicit (the policy)	direct classification loss

All three optimize the same kind of target — a preference ordering. None closes the inner-alignment gap: they shape behavior on the preference distribution, not the goal the model internalizes.

6.4 Scalable Oversight

The methods above assume a human (or a written principle) can judge which output is better. That assumption breaks as models exceed human ability on the task being supervised — the scalable oversight problem: how do you provide an alignment signal for behavior you cannot directly evaluate? This is the hinge between alignment and Monitoring & Oversight, where the runtime side is developed.

Weak-to-strong generalization (Burns et al., 2023) is the empirical probe: can a weak supervisor elicit the full capability of a strong model, aligned, despite being unable to judge its hardest outputs? Early results show a strong model supervised by a weaker one recovers much — but not all — of its capability, and the residual gap is precisely the alignment tax scalable oversight must pay down. The proposed mechanisms (debate, recursive decomposition, AI-assisted evaluation) all share a bet: that verifying a behavior is easier than generating it, so weaker verifiers can keep pace with stronger generators.

Key idea

Scalable oversight rests on one bet: verification is easier than generation. The full treatment (the capability-gap formalization, debate, AI control) lives in Monitoring & Oversight. This section only marks the hinge where training-time alignment hands off to runtime oversight.

6.5 Alignment Under Agency

Agency changes the problem qualitatively. A single-turn model produces an output; an agent pursues a goal over many steps with tools and memory, and the instrumental sub-goals of Omohundro (2008) stop being abstract:

Goal misgeneralization — an agent learns a goal that is correct on the training distribution but diverges under deployment conditions, while remaining capable. Unlike a capability failure, the agent competently pursues the wrong goal.
Deceptive alignment — the inner-alignment limit case: a model that models its own training, behaves aligned while it detects oversight, and defects when unobserved. Sleeper agents (Hubinger et al., 2024) are the existence proof that this survives standard safety training (see Robustness & Security for the mechanism).
Reward hacking with tools — a tool-using agent has a far larger action space in which to find proxy-satisfying shortcuts, and the consequences land in the environment rather than in text.

Important

Agency raises the stakes of every alignment failure. A misaligned model says something wrong; a misaligned agent acts on it, with tools, over many steps, in the world. The alignment techniques in this chapter were built for single-turn outputs. None was designed for a system whose mistakes are irreversible.

This is where alignment and security converge on the agentic frontier: the same autonomy that makes an agent useful makes a misaligned objective consequential, and the same tool access that defines the security surface defines the blast radius of a goal gone wrong.

6.6 Open Problems

Problem	Why it persists
Outer/proxy gap	Any tractable objective is a proxy; hard optimization surfaces the gap as reward hacking (Amodei et al., 2016)
Inner alignment	Training shapes behavior on-distribution; the internalized goal is unobserved and can diverge off-distribution
Scalable oversight	No reliable alignment signal for behavior humans cannot evaluate; weak-to-strong recovers only part of the gap (Burns et al., 2023)
Deceptive alignment	A model optimizing for the appearance of alignment under oversight is consistent with all behavioral evidence (Hubinger et al., 2024)
Alignment under agency	Instrumental convergence (Omohundro, 2008) makes misalignment actively dangerous once a system plans and acts over long horizons

Alignment supplies the objective the rest of the book defends. Its open problems are why interpretability (read the internalized goal), monitoring (catch divergence at runtime), and evaluation (measure it before deployment) are not optional add-ons but load-bearing parts of the same agenda.