3 The AI Safety & Security Landscape
4 The AI Safety & Security Landscape
The panoramic map of the field — read once to orient; topic chapters deepen each thread chronologically.
- Core areas — five pillars and how they interlock.
- Timeline — theory → the agentic frontier.
- State of practice — what is deployed today.
- Open problems — what motivates the rest of the book.
4.1 Core Areas
- Alignment — ensuring AI systems pursue the intended goals of their designers and users, avoiding unintended side effects or perverse instantiations.
- Robustness & Security — performing reliably under novel situations, adversarial attacks, and out-of-distribution inputs; defending the model and its surrounding system against exploitation.
- Interpretability & Explainability — understanding internal mechanisms so decision-making is transparent and auditable.
- Systemic Safety — broader impacts of deployment: cyber-security, socio-economic effects, and catastrophic risks.
- Monitoring & Control — overseeing deployed systems and intervening or shutting them down when behavior turns dangerous.
For the deeper lineage these areas grew from, see Historical Roots (1950–2015).
4.2 Timeline
The eras are threads, not a strict partition — they overlap and run in parallel (notably the safety and security lineages), which is why the field has more than a single line of descent.
| Era | Period | Key Developments | Status |
|---|---|---|---|
| Foundations / Theory | Pre-2015 | Value alignment, inverse RL, reward specification | Foundational |
| Adversarial ML & Security | 2014–2019 | Adversarial examples, robustness, model extraction/inversion — the security lineage | Foundational |
| Alignment Era | 2019–2022 | RLHF, RLAIF, Constitutional AI, fine-tuning at scale | Deployed in LLMs |
| Interpretability Surge | 2022–2024 | Mechanistic circuits, causal tracing; jailbreaks & prompt injection emerge | Research frontier |
| Scalable Oversight | 2023–2025 | Weak-to-strong generalization, AI-supervised evals; guardrail models & injection defenses | Emerging practice |
| Agentic Frontier | 2025+ | Agent evaluation, tool-use guardrails, multi-agent safety, agent attack surface | New frontier |
4.3 State of Practice (2025)
- RLHF & RLAIF — reinforcement learning from human/AI feedback is the standard for aligning LLMs with human preferences.
- Constitutional AI — training models against an explicit set of principles, reducing reliance on extensive human labeling.
- Mechanistic Interpretability — reverse-engineering networks to identify the circuits and features responsible for specific behaviors.
- Scalable Oversight — supervising systems on tasks too complex for direct human evaluation (e.g., weaker models supervising stronger ones).
- Automated Red Teaming — using AI to systematically probe for vulnerabilities, biases, and unsafe capabilities.
4.4 The field at a glance
4.5 Open Problems & Risk Scenarios
4.5.1 Technical Challenges
- Scalable Oversight — evaluating behavior on tasks humans cannot directly judge (scientific discovery, long-horizon planning).
- Empirical Alignment — measuring genuine goal alignment vs. behavioral mimicry; avoiding reward hacking.
- Robustness at Scale — preserving safety properties as capability and deployment surface grow.
- Interpretability Trade-offs — deep understanding without sacrificing capability or inference speed.
- Agentic Safety × Security — guardrails that survive tool-use, multi-step reasoning, and adversarial prompts.
4.5.2 Real-World Risk Scenarios
| Scenario | Affected System | Core Area | Mitigation |
|---|---|---|---|
| Supply-chain automation agent misuses supplier data | Agentic systems | Alignment, Monitoring | Tool-use firewall, audit logs |
| LLM generates biased loan decisions at scale | Deployed LLM | Robustness, Systemic Safety | Red teaming, explainability |
| Prompt injection bypasses Constitutional AI rules | Fine-tuned model | Robustness, Alignment | Adversarial training, input filters |
| Multi-agent coordination yields emergent unsafe behavior | Multi-agent systems | Systemic Safety, Monitoring | Inter-agent oversight, rollback |
| Embodied robot receives contradictory commands | Robotic agent | Alignment, Monitoring & Control | Safety layer, human-override |
4.6 Governance & Compliance
4.6.1 Regulatory Context
- EU AI Act — risk-based classification; guardrails required for “high-risk” systems.
- US executive action — EO 14110 (2023) set safety-evaluation and red-teaming guidance. [Updated 2026: EO 14110 was rescinded in Jan 2025; US posture shifted toward the NIST/CAISI (Center for AI Standards and Innovation) standards track.]
- NIST AI RMF (+ Generative AI Profile, AI 600-1) — managing AI risk across the lifecycle.
- Frontier safety frameworks — lab-side: Anthropic Responsible Scaling Policy, OpenAI Preparedness Framework, DeepMind Frontier Safety Framework.
- Corporate Governance — internal safety reviews, model/system cards, pre-submission gates.
4.6.2 Practices for Practitioners
- Evaluation — use multiple benchmarks (not just task accuracy; include robustness, fairness, interpretability).
- Red Teaming — systematically probe failure modes; document adversarial examples.
- Audit Trail — log decisions, guardrail triggers, human interventions.
- Transparency — document assumptions, limitations, regulatory constraints.
- Continuous Monitoring — track safety metrics post-deployment; escalate anomalies.