3 The AI Safety & Security Landscape

TL;DR

The field rests on five pillars: alignment, robustness & security, interpretability, systemic safety, monitoring & control.
Its history runs theory → adversarial ML → alignment → interpretability → oversight → the agentic frontier.
The eras are threads, not a partition: the safety and security lineages run in parallel.
Everything deployed today (RLHF, CAI, guardrails, red-teaming) is partial; the open problems are what motivate the rest of the book.

The panoramic map of the field — read once to orient; topic chapters deepen each thread chronologically.

Core areas — five pillars and how they interlock.
Timeline — theory → the agentic frontier.
State of practice — what is deployed today.
Open problems — what motivates the rest of the book.

3.1 Core Areas

Alignment — ensuring AI systems pursue the intended goals of their designers and users, avoiding unintended side effects or perverse instantiations.
Robustness & Security — performing reliably under novel situations, adversarial attacks, and out-of-distribution inputs; defending the model and its surrounding system against exploitation.
Interpretability & Explainability — understanding internal mechanisms so decision-making is transparent and auditable.
Systemic Safety — broader impacts of deployment: cyber-security, socio-economic effects, and catastrophic risks.
Monitoring & Control — overseeing deployed systems and intervening or shutting them down when behavior turns dangerous.

Key idea

These five pillars are the book’s spine. Each gets a Part II chapter, and the flagship Agentic Safety × Security chapter is where they converge, because an agent fails across all five at once rather than one at a time.

For the deeper lineage these areas grew from, see Historical Roots (1950–2015).

3.2 Timeline

The eras are threads, not a strict partition — they overlap and run in parallel (notably the safety and security lineages), which is why the field has more than a single line of descent.

Era	Period	Key Developments	Status
Foundations / Theory	Pre-2015	Value alignment, inverse RL, reward specification	Foundational
Adversarial ML & Security	2014–2019	Adversarial examples, robustness, model extraction/inversion — the security lineage	Foundational
Alignment Era	2019–2022	RLHF, RLAIF, Constitutional AI, fine-tuning at scale	Deployed in LLMs
Interpretability Surge	2022–2024	Mechanistic circuits, causal tracing; jailbreaks & prompt injection emerge	Research frontier
Scalable Oversight	2023–2025	Weak-to-strong generalization, AI-supervised evals; guardrail models & injection defenses	Emerging practice
Agentic Frontier	2025+	Agent evaluation, tool-use guardrails, multi-agent safety, agent attack surface	New frontier

3.3 State of Practice (2025)

Time-sensitive

State of practice reflects the last review. Deployed technique moves on 12–24 month cycles, so treat the list below as a snapshot rather than a standing claim.

RLHF & RLAIF — reinforcement learning from human/AI feedback is the standard for aligning LLMs with human preferences.
Constitutional AI — training models against an explicit set of principles, reducing reliance on extensive human labeling.
Mechanistic Interpretability — reverse-engineering networks to identify the circuits and features responsible for specific behaviors.
Scalable Oversight — supervising systems on tasks too complex for direct human evaluation (e.g., weaker models supervising stronger ones).
Automated Red Teaming — using AI to systematically probe for vulnerabilities, biases, and unsafe capabilities.

3.4 The field at a glance

3.5 Open Problems & Risk Scenarios

3.5.1 Technical Challenges

Scalable Oversight — evaluating behavior on tasks humans cannot directly judge (scientific discovery, long-horizon planning).
Empirical Alignment — measuring genuine goal alignment vs. behavioral mimicry; avoiding reward hacking.
Robustness at Scale — preserving safety properties as capability and deployment surface grow.
Interpretability Trade-offs — deep understanding without sacrificing capability or inference speed.
Agentic Safety × Security — guardrails that survive tool-use, multi-step reasoning, and adversarial prompts.

3.5.2 Real-World Risk Scenarios

Scenario	Affected System	Core Area	Mitigation
Supply-chain automation agent misuses supplier data	Agentic systems	Alignment, Monitoring	Tool-use firewall, audit logs
LLM generates biased loan decisions at scale	Deployed LLM	Robustness, Systemic Safety	Red teaming, explainability
Prompt injection bypasses Constitutional AI rules	Fine-tuned model	Robustness, Alignment	Adversarial training, input filters
Multi-agent coordination yields emergent unsafe behavior	Multi-agent systems	Systemic Safety, Monitoring	Inter-agent oversight, rollback
Embodied robot receives contradictory commands	Robotic agent	Alignment, Monitoring & Control	Safety layer, human-override

3.6 Governance & Compliance

3.6.1 Regulatory Context

Time-sensitive

Regulation moves faster than the publication cycle. Entries below reflect status as of last review, so verify against primary sources before relying on them. Tracked in Systemic Safety & Governance.

EU AI Act — risk-based classification; guardrails required for “high-risk” systems.
US executive action — EO 14110 (2023) set safety-evaluation and red-teaming guidance. [Updated 2026: EO 14110 was rescinded in Jan 2025; US posture shifted toward the NIST/CAISI (Center for AI Standards and Innovation) standards track.]
NIST AI RMF (+ Generative AI Profile, AI 600-1) — managing AI risk across the lifecycle.
Frontier safety frameworks — lab-side: Anthropic Responsible Scaling Policy, OpenAI Preparedness Framework, DeepMind Frontier Safety Framework.
Corporate Governance — internal safety reviews, model/system cards, pre-submission gates.

3.6.2 Practices for Practitioners

Evaluation — use multiple benchmarks (not just task accuracy; include robustness, fairness, interpretability).
Red Teaming — systematically probe failure modes; document adversarial examples.
Audit Trail — log decisions, guardrail triggers, human interventions.
Transparency — document assumptions, limitations, regulatory constraints.
Continuous Monitoring — track safety metrics post-deployment; escalate anomalies.