17 References

Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., & Mané, D. (2016). Concrete problems in AI safety. arXiv Preprint arXiv:1606.06565.

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., et al. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv Preprint arXiv:2212.08073.

Bengio, Y. et al. (2025). International AI safety report. arXiv Preprint arXiv:2501.17805.

Bhagwatkar, R., Kasa, K., Puri, A., Huang, G., Rish, I., Taylor, G. W., Dvijotham, K. D., & Lacoste, A. (2025). Indirect prompt injections: Are firewalls all you need, or stronger benchmarks? arXiv Preprint arXiv:2510.05244.

Bostrom, N. (2014). Superintelligence: Paths, dangers, strategies. Oxford University Press.

Bowman, S. R., Hyun, J., Perez, E., Chen, E., Pettit, C., Heiner, S., et al. (2022). Measuring progress on scalable oversight for large language models. arXiv Preprint arXiv:2211.03540.

Burns, C., Izmailov, P., Kirchner, J. H., Baker, B., Gao, L., Aschenbrenner, L., Chen, Y., Ecoffet, A., Joglekar, M., Leike, J., Sutskever, I., & Wu, J. (2023). Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv Preprint arXiv:2312.09390.

Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T. B., Song, D., Erlingsson, Ú., Oprea, A., & Raffel, C. (2021). Extracting training data from large language models. Proceedings of the 30th USENIX Security Symposium, 2633–2650.

Chang, H., Bao, E., Luo, X., & Yu, T. (2026). Overcoming the retrieval barrier: Indirect prompt injection in the wild for LLM systems. arXiv Preprint arXiv:2601.07072.

Christiano, P. F., Leike, J., Brown, T. B., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems (NeurIPS).

Debenedetti, E., Zhang, J., Balunović, M., Beurer-Kellner, L., Fischer, M., & Tramèr, F. (2024). AgentDojo: A dynamic environment to evaluate attacks and defenses for LLM agents. Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks.

Dobbe, R. (2022). System safety and artificial intelligence. arXiv Preprint arXiv:2202.09292.

Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., et al. (2021). A mathematical framework for transformer circuits. Transformer Circuits Thread (Anthropic).

Engels, J., Baek, D. D., Kantamneni, S., & Tegmark, M. (2025). Scaling laws for scalable oversight. Advances in Neural Information Processing Systems (NeurIPS).

European Parliament and Council of the European Union. (2024). Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (artificial intelligence act). Official Journal of the European Union.

Good, I. J. (1965). Speculations concerning the first ultraintelligent machine. In Advances in computers (Vol. 6, pp. 31–88). Elsevier.

Google. (2023). Secure AI framework (SAIF). https://saif.google/.

Greenblatt, R., Shlegeris, B., Sachan, K., & Roger, F. (2023). AI control: Improving safety despite intentional subversion. arXiv Preprint arXiv:2312.06942.

Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., & Fritz, M. (2023). Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. arXiv Preprint arXiv:2302.12173.

Hendrycks, D., Mazeika, M., & Woodside, T. (2023). An overview of catastrophic AI risks. arXiv Preprint arXiv:2306.12001.

Hilton, B., Buhl, M. D., Korbak, T., & Irving, G. (2025). Safety cases: A scalable approach to frontier AI safety. arXiv Preprint arXiv:2503.04744.

Hubinger, E., Denison, C., Mu, J., Lambert, M., Tong, M., MacDiarmid, M., et al. (2024). Sleeper agents: Training deceptive LLMs that persist through safety training. arXiv Preprint arXiv:2401.05566.

Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y., Tontchev, M., Hu, Q., Fuller, B., Testuggine, D., & Khabsa, M. (2023). Llama guard: LLM-based input-output safeguard for human-AI conversations. arXiv Preprint arXiv:2312.06674.

International Organization for Standardization. (2023). ISO/IEC 42001:2023 — information technology — artificial intelligence — management system. ISO/IEC.

Kim, J., Liu, X., Wang, Z., Qiu, S., Li, B., Guo, W., & Song, D. (2026). The attack and defense landscape of agentic AI: A comprehensive survey. USENIX Security Symposium.

Leveson, N. G. (1995). Safeware: System safety and computers. Addison-Wesley.

Li, Y., Huang, H., Zhao, Y., Ma, X., & Sun, J. (2025). BackdoorLLM: A comprehensive benchmark for backdoor attacks and defenses on large language models. Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks.

Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., et al. (2023). Holistic evaluation of language models. Transactions on Machine Learning Research.

Lieberum, T., Rajamanoharan, S., Conmy, A., Smith, L., Sonnerat, N., Varma, V., Kramár, J., Dragan, A., Shah, R., & Nanda, N. (2024). Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2. arXiv Preprint arXiv:2408.05147.

Lindsey, J., Batson, J., Denison, C., et al. (2025). Circuit tracing: Revealing computational graphs in language models. Transformer Circuits Thread (Anthropic).

Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., Forsyth, D., & Hendrycks, D. (2024). HarmBench: A standardized evaluation framework for automated red teaming and robust refusal. International Conference on Machine Learning (ICML).

MITRE. (n.d.). ATLAS: Adversarial threat landscape for artificial-intelligence systems. https://atlas.mitre.org/.

National Institute of Standards and Technology. (2023). Artificial intelligence risk management framework (AI RMF 1.0) (NIST AI 100-1). NIST. https://doi.org/10.6028/NIST.AI.100-1

Ng, A. Y., & Russell, S. (2000). Algorithms for inverse reinforcement learning. Proceedings of the 17th International Conference on Machine Learning (ICML), 663–670.

Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., & Carter, S. (2020). Zoom in: An introduction to circuits. Distill. https://doi.org/10.23915/distill.00024.001

Omohundro, S. M. (2008). The basic AI drives. Artificial General Intelligence (AGI).

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems (NeurIPS).

OWASP Foundation. (2025). OWASP top 10 for large language model applications. https://owasp.org/www-project-top-10-for-large-language-model-applications/.

Panfilov, A., Kortukov, E., Nikolić, K., Bethge, M., Lapuschkin, S., Samek, W., Prabhu, A., Andriushchenko, M., & Geiping, J. (2025). Strategic dishonesty can undermine AI safety evaluations of frontier LLMs. arXiv Preprint arXiv:2509.18058.

Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., & Irving, G. (2022). Red teaming language models with language models. arXiv Preprint arXiv:2202.03286.

Phelps, S., & Russell, Y. I. (2023). Emergent cooperation and strategy adaptation in multi-agent systems: An extended coevolutionary theory with LLMs. Proceedings of the 2023 AAAI Symposium on AI and Narrative Intelligence.

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems (NeurIPS).

Sharma, M., Tong, M., Mu, J., Wei, J., Kruthoff, J., Goodfriend, S., Ong, E., Peng, A., Agarwal, R., Anil, C., Askell, A., Bailey, N., Benton, J., Bluemke, E., Bowman, S. R., Christiansen, E., Cunningham, H., Dau, A., Gopal, A., … Perez, E. (2025). Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming. arXiv Preprint arXiv:2501.18837.

Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., et al. (2024). Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet. Transformer Circuits Thread (Anthropic).

Wiener, N. (1960). Some moral and technical consequences of automation. Science, 131(3410), 1355–1358.

Wijk, H., Lin, T., Becker, J., Jawhar, S., Parikh, N., Broadley, T., Chan, L., Chen, M., Clymer, J., Dhyani, J., Ericheva, E., Garcia, K., Goodrich, B., Jurkovic, N., Karnofsky, H., Kinniment, M., Lajko, A., Nix, S., Sato, L., … Barnes, E. (2024). RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts. arXiv Preprint arXiv:2411.15114.

Zhang, H., Huang, J., Mei, K., Yao, Y., Wang, Z., Zhan, C., Wang, H., & Zhang, Y. (2025). Agent security bench (ASB): Formalizing and benchmarking attacks and defenses in LLM-based agents. International Conference on Learning Representations (ICLR).

Zou, A., Phan, L., Wang, J., Duenas, D., Lin, M., Andriushchenko, M., Wang, R., Kolter, Z., Fredrikson, M., & Hendrycks, D. (2024). Improving alignment and robustness with circuit breakers. arXiv Preprint arXiv:2406.04313.

Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., & Fredrikson, M. (2023). Universal and transferable adversarial attacks on aligned language models. arXiv Preprint arXiv:2307.15043.