3  The AI Safety & Security Landscape

4 The AI Safety & Security Landscape

The panoramic map of the field — read once to orient; topic chapters deepen each thread chronologically.

  • Core areas — five pillars and how they interlock.
  • Timeline — theory → the agentic frontier.
  • State of practice — what is deployed today.
  • Open problems — what motivates the rest of the book.

4.1 Core Areas

  1. Alignment — ensuring AI systems pursue the intended goals of their designers and users, avoiding unintended side effects or perverse instantiations.
  2. Robustness & Security — performing reliably under novel situations, adversarial attacks, and out-of-distribution inputs; defending the model and its surrounding system against exploitation.
  3. Interpretability & Explainability — understanding internal mechanisms so decision-making is transparent and auditable.
  4. Systemic Safety — broader impacts of deployment: cyber-security, socio-economic effects, and catastrophic risks.
  5. Monitoring & Control — overseeing deployed systems and intervening or shutting them down when behavior turns dangerous.

For the deeper lineage these areas grew from, see Historical Roots (1950–2015).

4.2 Timeline

The eras are threads, not a strict partition — they overlap and run in parallel (notably the safety and security lineages), which is why the field has more than a single line of descent.

Era Period Key Developments Status
Foundations / Theory Pre-2015 Value alignment, inverse RL, reward specification Foundational
Adversarial ML & Security 2014–2019 Adversarial examples, robustness, model extraction/inversion — the security lineage Foundational
Alignment Era 2019–2022 RLHF, RLAIF, Constitutional AI, fine-tuning at scale Deployed in LLMs
Interpretability Surge 2022–2024 Mechanistic circuits, causal tracing; jailbreaks & prompt injection emerge Research frontier
Scalable Oversight 2023–2025 Weak-to-strong generalization, AI-supervised evals; guardrail models & injection defenses Emerging practice
Agentic Frontier 2025+ Agent evaluation, tool-use guardrails, multi-agent safety, agent attack surface New frontier

4.3 State of Practice (2025)

  • RLHF & RLAIF — reinforcement learning from human/AI feedback is the standard for aligning LLMs with human preferences.
  • Constitutional AI — training models against an explicit set of principles, reducing reliance on extensive human labeling.
  • Mechanistic Interpretability — reverse-engineering networks to identify the circuits and features responsible for specific behaviors.
  • Scalable Oversight — supervising systems on tasks too complex for direct human evaluation (e.g., weaker models supervising stronger ones).
  • Automated Red Teaming — using AI to systematically probe for vulnerabilities, biases, and unsafe capabilities.

4.4 The field at a glance

The AI safety and security landscape: five pillars on shared foundations Theoretical foundations — value alignment · inverse RL · adversarial ML Alignment RLHF · DPO Constitutional AI Robustness adversarial ML injection defense Interpretability circuits · SAEs probing Monitoring scalable oversight red team · control Systemic governance · evals catastrophic risk Aligned & secure agentic systems

4.5 Open Problems & Risk Scenarios

4.5.1 Technical Challenges

  • Scalable Oversight — evaluating behavior on tasks humans cannot directly judge (scientific discovery, long-horizon planning).
  • Empirical Alignment — measuring genuine goal alignment vs. behavioral mimicry; avoiding reward hacking.
  • Robustness at Scale — preserving safety properties as capability and deployment surface grow.
  • Interpretability Trade-offs — deep understanding without sacrificing capability or inference speed.
  • Agentic Safety × Security — guardrails that survive tool-use, multi-step reasoning, and adversarial prompts.

4.5.2 Real-World Risk Scenarios

Scenario Affected System Core Area Mitigation
Supply-chain automation agent misuses supplier data Agentic systems Alignment, Monitoring Tool-use firewall, audit logs
LLM generates biased loan decisions at scale Deployed LLM Robustness, Systemic Safety Red teaming, explainability
Prompt injection bypasses Constitutional AI rules Fine-tuned model Robustness, Alignment Adversarial training, input filters
Multi-agent coordination yields emergent unsafe behavior Multi-agent systems Systemic Safety, Monitoring Inter-agent oversight, rollback
Embodied robot receives contradictory commands Robotic agent Alignment, Monitoring & Control Safety layer, human-override

4.6 Governance & Compliance

4.6.1 Regulatory Context

  • EU AI Act — risk-based classification; guardrails required for “high-risk” systems.
  • US executive action — EO 14110 (2023) set safety-evaluation and red-teaming guidance. [Updated 2026: EO 14110 was rescinded in Jan 2025; US posture shifted toward the NIST/CAISI (Center for AI Standards and Innovation) standards track.]
  • NIST AI RMF (+ Generative AI Profile, AI 600-1) — managing AI risk across the lifecycle.
  • Frontier safety frameworks — lab-side: Anthropic Responsible Scaling Policy, OpenAI Preparedness Framework, DeepMind Frontier Safety Framework.
  • Corporate Governance — internal safety reviews, model/system cards, pre-submission gates.

4.6.2 Practices for Practitioners

  1. Evaluation — use multiple benchmarks (not just task accuracy; include robustness, fairness, interpretability).
  2. Red Teaming — systematically probe failure modes; document adversarial examples.
  3. Audit Trail — log decisions, guardrail triggers, human interventions.
  4. Transparency — document assumptions, limitations, regulatory constraints.
  5. Continuous Monitoring — track safety metrics post-deployment; escalate anomalies.