11  Evaluation & Benchmarks

12 Evaluation & Benchmarks

Stub. Chronological deep-dive: static accuracy benchmarks → holistic evaluation (HELM) → safety/red-teaming evals → agentic benchmarks (AgentDojo, τ-bench, Agent Security Bench) → dangerous-capability evaluations (CBRN, cyber, autonomy, self-proliferation) and the frontier-policy thresholds they trigger. The methodological backbone of the evaluation review paper.