16 References
Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J.,
& Mané, D. (2016). Concrete problems in AI safety.
arXiv Preprint arXiv:1606.06565.
Bengio, Y. et al. (2025). International
AI safety report. arXiv Preprint arXiv:2501.17805.
Bostrom, N. (2014). Superintelligence: Paths, dangers,
strategies. Oxford University Press.
Bowman, S. R., Hyun, J., Perez, E., Chen, E.,
Pettit, C., Heiner, S., et al. (2022). Measuring progress on
scalable oversight for large language models. arXiv Preprint
arXiv:2211.03540.
Burns, C., Izmailov, P., Kirchner, J. H., Baker, B., Gao, L.,
Aschenbrenner, L., Chen, Y., Ecoffet, A., Joglekar, M., Leike, J.,
Sutskever, I., & Wu, J. (2023). Weak-to-strong generalization:
Eliciting strong capabilities with weak supervision. arXiv Preprint
arXiv:2312.09390.
Chang, H., Bao, E., Luo, X., & Yu, T. (2026). Overcoming the
retrieval barrier: Indirect prompt injection in the wild for
LLM systems. arXiv Preprint arXiv:2601.07072.
Debenedetti, E., Zhang, J., Balunović, M., Beurer-Kellner, L., Fischer,
M., & Tramèr, F. (2024). AgentDojo: A dynamic environment to
evaluate attacks and defenses for LLM agents. Advances
in Neural Information Processing Systems (NeurIPS) Datasets and
Benchmarks.
Dobbe, R. (2022). System safety and artificial intelligence. arXiv
Preprint arXiv:2202.09292.
European Parliament and Council of the European Union. (2024).
Regulation (EU) 2024/1689 laying down harmonised rules on artificial
intelligence (artificial intelligence act). Official Journal of the
European Union.
Good, I. J. (1965). Speculations concerning the first ultraintelligent
machine. In Advances in computers (Vol. 6, pp. 31–88).
Elsevier.
Greenblatt, R., Shlegeris, B., Sachan, K., & Roger, F. (2023).
AI control: Improving safety despite intentional
subversion. arXiv Preprint arXiv:2312.06942.
Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., &
Fritz, M. (2023). Not what you’ve signed up for: Compromising real-world
LLM-integrated applications with indirect prompt injection.
arXiv Preprint arXiv:2302.12173.
Hendrycks, D., Mazeika, M., & Woodside, T. (2023). An overview of
catastrophic AI risks. arXiv Preprint
arXiv:2306.12001.
Hilton, B., Buhl, M. D., Korbak, T., & Irving, G. (2025). Safety
cases: A scalable approach to frontier AI safety. arXiv
Preprint arXiv:2503.04744.
Huang et al. (2025). Indirect prompt
injections: Are firewalls all you need, or stronger benchmarks?
Advances in Neural Information Processing Systems (NeurIPS).
Hubinger, E., Denison, C., Mu, J., Lambert, M.,
Tong, M., MacDiarmid, M., et al. (2024). Sleeper agents: Training
deceptive LLMs that persist through safety training.
arXiv Preprint arXiv:2401.05566.
Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y., Tontchev,
M., Hu, Q., Fuller, B., Testuggine, D., & Khabsa, M. (2023). Llama
guard: LLM-based input-output safeguard for human-AI conversations.
arXiv Preprint arXiv:2312.06674.
International Organization for Standardization. (2023). ISO/IEC
42001:2023 — information technology — artificial intelligence —
management system. ISO/IEC.
Kim, J., Liu, X., Wang, Z., Qiu, S., Li, B., Guo, W., & Song, D.
(2026). The attack and defense landscape of agentic AI: A
comprehensive survey. USENIX Security Symposium.
Leveson, N. G. (1995). Safeware: System safety and computers.
Addison-Wesley.
Lieberum, T., Rajamanoharan, S., Conmy, A., Smith, L., Sonnerat, N.,
Varma, V., Kramár, J., Dragan, A., Shah, R., & Nanda, N. (2024).
Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2.
arXiv Preprint arXiv:2408.05147.
MITRE. (n.d.). ATLAS: Adversarial threat landscape for
artificial-intelligence systems. https://atlas.mitre.org/.
National Institute of Standards and Technology. (2023). Artificial
intelligence risk management framework (AI RMF
1.0) (NIST AI 100-1). NIST. https://doi.org/10.6028/NIST.AI.100-1
Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., &
Carter, S. (2020). Zoom in: An introduction to circuits.
Distill. https://doi.org/10.23915/distill.00024.001
Omohundro, S. M. (2008). The basic AI drives.
Artificial General Intelligence (AGI).
Ouyang, L., Wu, J., Jiang, X., Almeida, D.,
Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray,
A., et al. (2022). Training language models to follow
instructions with human feedback. Advances in Neural Information
Processing Systems (NeurIPS).
OWASP Foundation. (2025). OWASP top 10 for large language model
applications. https://owasp.org/www-project-top-10-for-large-language-model-applications/.
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., &
Finn, C. (2023). Direct preference optimization: Your language model is
secretly a reward model. Advances in Neural Information Processing
Systems (NeurIPS).
Sharma, M. et al. (2025). Constitutional
classifiers: Defending against universal jailbreaks across thousands of
hours of red teaming. arXiv Preprint arXiv:2501.18837.
Templeton, A., Conerly, T., Marcus, J., Lindsey,
J., Bricken, T., et al. (2024). Scaling monosemanticity:
Extracting interpretable features from claude 3 sonnet. Transformer
Circuits Thread (Anthropic).
Wiener, N. (1960). Some moral and technical consequences of automation.
Science, 131(3410), 1355–1358.
Zhang, H. et al. (2024). Agent security
bench (ASB): Formalizing and benchmarking attacks and
defenses in LLM-based agents. arXiv Preprint
arXiv:2410.02644.
Zou, A., Phan, L., Wang, J., Duenas, D., Lin, M., Andriushchenko, M.,
Wang, R., Kolter, Z., Fredrikson, M., & Hendrycks, D. (2024).
Improving alignment and robustness with circuit breakers. arXiv
Preprint arXiv:2406.04313.
Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., &
Fredrikson, M. (2023). Universal and transferable adversarial attacks on
aligned language models. arXiv Preprint arXiv:2307.15043.