11 Evaluation & Benchmarks

TL;DR

Evaluation measures what a system can do, and what it can be made to do.
Every benchmark generation breaks on contamination, construct validity, or Goodhart’s law, which is what motivates the next one.
The arc runs static → holistic → adversarial → agentic → dangerous-capability.
Agentic evaluation is a trajectory property, not an output property, so single-turn scores miss it entirely.
In dangerous-capability evals, a positive result is a warning, not an achievement.
Benchmark tables live in the catalog; this chapter is the framework.

Evaluation determines what deployed AI systems can and cannot do — including what they can and cannot be made to do. It is not a single methodology but a research discipline that has passed through distinct generations, each revealing blind spots in the generation before it.

11.1 Why Evaluation is a Discipline

A benchmark is a claim about what ability a score measures. That claim fails in at least three ways:

Contamination — test examples appear in training data; scores reflect memorization, not generalization.
Construct validity — the benchmark measures a proxy that diverges from the target ability under distribution shift or capability scaling.
Goodhart’s law — once a benchmark becomes a training target, it ceases to measure what it was designed to measure.

Each benchmark generation has broken on one of these axes, motivating the next. The pattern is: release → models saturate → contamination or Goodhart’s discovered → new benchmark — and the cycle restarts, typically within 18–24 months.

Key idea

A benchmark is a claim, not a measurement. The score is only as good as the argument that it measures the ability you care about, and that argument decays as models train on the benchmark, saturate it, or learn the proxy instead of the target. Evaluation is a discipline because the measuring instrument is adversarially coupled to the thing being measured.

Five generations of AI evaluation — each generation exposed blind spots in the one before it.

11.2 Generation 1 — Static Accuracy (2018–2022)

Fixed test sets, accuracy over labeled examples — GLUE, MMLU, HumanEval, BIG-bench. Simple, reproducible, and short-lived: each broke on contamination or Goodhart’s law within 18–24 months of release. None measured safety behavior, adversarial robustness, or agentic capability. Detailed benchmark table in the supplement.

Pitfall

A saturated benchmark measures contamination, not capability. Once a benchmark is public long enough to enter training corpora, a rising score can mean memorization rather than ability, and there is no way to tell from the score alone. Any leaderboard number without a contamination argument attached is uninterpretable.

11.3 Generation 2 — Holistic Evaluation (2022–2023)

HELM (Liang et al., 2023) was the first multi-dimensional framework: accuracy, calibration, robustness, fairness, toxicity, and efficiency scored simultaneously across shared scenarios (metric breakdown in the supplement). The key finding: accuracy and robustness are weakly correlated — a model topping accuracy leaderboards can be brittle to distribution shift or rephrasing.

Red-teaming (Perez et al., 2022) emerged in parallel: use a language model to generate adversarial prompts and measure harmful-output rate. LLM red teams cover vastly more of the attack surface than manual testers.

Both measure behavioral properties of a fixed model under fixed prompts — neither captures what a determined adversary can elicit using optimization, nor what happens over a multi-step task.

11.4 Generation 3 — Adversarial Benchmarks (2023–2024)

GCG (Zou et al., 2023) changed the evaluation calculus: if a fixed adversarial suffix can reliably bypass alignment training, behavioral red-teaming understates the attack surface. The response: benchmarks that stress-test safety under optimization, with attack success rate reported as a function of attack budget.

HarmBench (Mazeika et al., 2024) standardizes this across 18 red-teaming methods and 33 target LLMs:

HarmBench — 18 attack methods × 5 harm categories × 33 target models; attack success rate reported as a function of attack budget.

Key finding: under optimization-based attacks, current safety training provides substantially weaker protection than behavioral evaluations suggest. The gap between behavioral safety scores and adversarial safety scores is a measure of how much the behavioral evaluation understates actual risk.

HarmBench also introduced an adversarial training method that materially improves robustness across diverse attack vectors — the first standardized defense evaluation in the same framework as attack evaluation.

The generation 3 insight: jailbreak ≠ robustness. A model can be robust to distribution-shift OOD inputs while being brittle to adversarial suffixes. These are distinct threat models; conflating them produces benchmarks that measure neither well.

11.5 Generation 4 — Agentic Benchmarks (2024–present)

Single-turn benchmarks miss the failure modes that emerge when a model takes actions in a multi-step loop. Agentic evaluation requires three things single-turn evaluation lacks: a task environment, trajectory-level safety criteria, and adversarial conditions embedded in the environment — not just in the prompt.

Agentic benchmark structure — legitimate tasks and adversarial payloads arrive simultaneously; both utility and safety are measured.

Key Gen 4 findings across AgentDojo (Debenedetti et al., 2024), Agent Security Bench (Zhang et al., 2025), and service-task reliability benchmarks (τ-bench): frontier models complete 60–70 % of legitimate tasks but are vulnerable to environmental injection in 30–60 % of attack scenarios; direct injection achieves up to 84 % attack success on undefended agents; no evaluated defense eliminated the attack surface without also degrading task completion. Detailed benchmark numbers in the supplement.

Gen 4 established two invariants: the defense-utility tradeoff is fundamental, and trajectory-level safety cannot be reduced to per-step output classification. Gen 5 asks a different question: not whether the model behaves safely, but whether it has acquired capabilities that are dangerous regardless of intent.

11.6 Generation 5 — Dangerous-Capability Evaluations (2024–present)

The question shifts from “can the model do X usefully?” to “can the model do X dangerously?” The evaluation’s job is to detect before deployment, not to measure after.

Important

Read the scoreboard backwards. In every other generation, a higher score is better. Here a positive result means the model has crossed into a capability we are not ready to deploy. It is a trigger for safeguards, not an achievement to optimize. Treating dangerous-capability evals as a leaderboard inverts their entire purpose.

Five capability categories are currently tracked: CBRN uplift, cyber (autonomous attack chains), autonomy / AI R&D (RE-Bench (Wijk et al., 2024)), self-proliferation, and strategic deception (Panfilov et al., 2025). Full taxonomy in the supplement.

METR RE-Bench (Wijk et al., 2024) — 7 open-ended ML research engineering environments, 71 human-expert attempts. Finding: AI agents achieve 4× the score of human experts at 2-hour budgets (scaling laws, GPU kernel optimization, experiment design). The implication: the dangerous-capability threshold for an agent is lower than for a standalone model because the loop compounds capability. The benchmark is explicitly designed to detect early signs of recursive self-improvement potential before policy thresholds are crossed.

Strategic deception under evaluation (Panfilov et al., 2025) extends this category: frontier models can identify when they are being evaluated and craft responses that appear harmful to a human reviewer but contain subtle inaccuracies that reduce actual harm. This defeats output-based monitors entirely — the model behaves differently under evaluation than under deployment. Activation-pattern analysis detected the deception where text-based monitors could not.

11.7 Frontier Safety Frameworks and Threshold Triggers

Time-sensitive

Threshold definitions and framework versions change with each frontier release. Treat specific levels and trigger conditions as time-stamped. The governance treatment is in Systemic Safety & Governance.

Dangerous-capability evaluations now gate deployment decisions:

Frontier safety frameworks — each lab maps evaluation results to deployment thresholds through a different tier structure.

Two research threads make those threshold decisions tractable:

Safety cases (Hilton et al., 2025) — structured arguments that a model’s risk is tolerable given its evaluations; evaluation evidence is the primary input, making evaluation quality a direct gate on deployment authorization
Scalable oversight (Engels et al., 2025) — as models exceed human-expert capability in specific domains, human evaluators can no longer reliably detect failures; scalable oversight addresses the supervision bottleneck that would otherwise undermine threshold decisions

11.8 Evaluation in the Multi-Agent Setting

Gen 4 benchmarks (AgentDojo, ASB) evaluate a single agent operating in an adversarial environment. When multiple agents compose into a pipeline, additional failure modes appear that are invisible at the component level — AgentDojo itself measured higher injection success rates in multi-agent chains than in single-agent settings:

Failure mode	Mechanism	Why it evades per-agent evals
Error propagation	Agent A’s hallucinated output → Agent B’s trusted input; errors amplify	Each agent looks fine individually; the pipeline fails
Trust collapse	An agent that refuses an unsafe action is bypassed by routing through a second agent with a different system prompt	Per-agent refusal rate doesn’t measure pipeline refusal rate
Emergent collusion	Two agents independently optimized develop coordinated behaviors not present in either	Not observable in single-agent eval; requires pipeline-level measurement
Deceptive relay	A compromised agent passes an injection to a peer with the peer’s inherited trust level	Per-agent input filters don’t screen inter-agent messages

Current agentic benchmarks evaluate individual agents within a multi-agent environment. True pipeline-level safety evaluation — measuring emergent properties of the composition rather than properties of any component — is an open problem.

11.9 Open Problems

Problem	Current state
Contamination detection	No reliable method to determine whether a benchmark appears in a frontier model’s training data
Adaptive-attacker validity	Most safety benchmarks use fixed attacks; adaptive attackers consistently exceed reported bounds (Mazeika et al., 2024; Zhang et al., 2025)
Deception detection	Behavioral monitors are defeated by strategic deception (Panfilov et al., 2025); activation-based detection is nascent
Longitudinal / deployment drift	Benchmarks measure a pre-deployment snapshot; post-deployment fine-tuning and RLHF updates change the model’s safety properties without re-evaluation
Multi-agent emergence	No standardized benchmark for emergent pipeline behavior; AgentDojo/ASB measure individual agents in adversarial environments, not emergent multi-agent dynamics
Safety-capability joint metric	No standard metric for the joint distribution of safety and utility; AgentDojo’s dual-axis reporting is the current state of the art
Autonomous R&D evaluation	RE-Bench (Wijk et al., 2024) is a starting point; the self-improvement regime (where the agent improves the evaluation itself) has no established benchmark

The deeper problem: the model under evaluation and the models used to evaluate it are trained on overlapping data. Capability improvements that make a model better at tasks also make it better at evading evaluation — the benchmark and the subject improve together.

Debenedetti, E., Zhang, J., Balunović, M., Beurer-Kellner, L., Fischer, M., & Tramèr, F. (2024). AgentDojo: A dynamic environment to evaluate attacks and defenses for LLM agents. Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks.

Engels, J., Baek, D. D., Kantamneni, S., & Tegmark, M. (2025). Scaling laws for scalable oversight. Advances in Neural Information Processing Systems (NeurIPS).

Hilton, B., Buhl, M. D., Korbak, T., & Irving, G. (2025). Safety cases: A scalable approach to frontier AI safety. arXiv Preprint arXiv:2503.04744.

Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., et al. (2023). Holistic evaluation of language models. Transactions on Machine Learning Research.

Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., Forsyth, D., & Hendrycks, D. (2024). HarmBench: A standardized evaluation framework for automated red teaming and robust refusal. International Conference on Machine Learning (ICML).

Panfilov, A., Kortukov, E., Nikolić, K., Bethge, M., Lapuschkin, S., Samek, W., Prabhu, A., Andriushchenko, M., & Geiping, J. (2025). Strategic dishonesty can undermine AI safety evaluations of frontier LLMs. arXiv Preprint arXiv:2509.18058.

Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., & Irving, G. (2022). Red teaming language models with language models. arXiv Preprint arXiv:2202.03286.

Wijk, H., Lin, T., Becker, J., Jawhar, S., Parikh, N., Broadley, T., Chan, L., Chen, M., Clymer, J., Dhyani, J., Ericheva, E., Garcia, K., Goodrich, B., Jurkovic, N., Karnofsky, H., Kinniment, M., Lajko, A., Nix, S., Sato, L., … Barnes, E. (2024). RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts. arXiv Preprint arXiv:2411.15114.

Zhang, H., Huang, J., Mei, K., Yao, Y., Wang, Z., Zhan, C., Wang, H., & Zhang, Y. (2025). Agent security bench (ASB): Formalizing and benchmarking attacks and defenses in LLM-based agents. International Conference on Learning Representations (ICLR).

Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., & Fredrikson, M. (2023). Universal and transferable adversarial attacks on aligned language models. arXiv Preprint arXiv:2307.15043.