Bengio, Y. et al. (2025). International AI safety report. arXiv Preprint arXiv:2501.17805.
Bowman, S. R., Hyun, J., Perez, E., Chen, E., Pettit, C., Heiner, S., et al. (2022). Measuring progress on scalable oversight for large language models. arXiv Preprint arXiv:2211.03540.
Burns, C., Izmailov, P., Kirchner, J. H., Baker, B., Gao, L., Aschenbrenner, L., Chen, Y., Ecoffet, A., Joglekar, M., Leike, J., Sutskever, I., & Wu, J. (2023). Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv Preprint arXiv:2312.09390.
Chang, H., Bao, E., Luo, X., & Yu, T. (2026). Overcoming the retrieval barrier: Indirect prompt injection in the wild for LLM systems. arXiv Preprint arXiv:2601.07072.
Debenedetti, E., Zhang, J., Balunović, M., Beurer-Kellner, L., Fischer, M., & Tramèr, F. (2024). AgentDojo: A dynamic environment to evaluate attacks and defenses for LLM agents. Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks.
European Parliament and Council of the European Union. (2024). Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (artificial intelligence act). Official Journal of the European Union.
Google. (2023).
Secure AI framework (SAIF).
https://saif.google/.
Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., & Fritz, M. (2023). Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. arXiv Preprint arXiv:2302.12173.
Hendrycks, D., Mazeika, M., & Woodside, T. (2023). An overview of catastrophic AI risks. arXiv Preprint arXiv:2306.12001.
Huang et al. (2025). Indirect prompt injections: Are firewalls all you need, or stronger benchmarks? Advances in Neural Information Processing Systems (NeurIPS).
Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y., Tontchev, M., Hu, Q., Fuller, B., Testuggine, D., & Khabsa, M. (2023). Llama guard: LLM-based input-output safeguard for human-AI conversations. arXiv Preprint arXiv:2312.06674.
International Organization for Standardization. (2023). ISO/IEC 42001:2023 — information technology — artificial intelligence — management system. ISO/IEC.
Kim, J., Liu, X., Wang, Z., Qiu, S., Li, B., Guo, W., & Song, D. (2026). The attack and defense landscape of agentic AI: A comprehensive survey. USENIX Security Symposium.
Lieberum, T., Rajamanoharan, S., Conmy, A., Smith, L., Sonnerat, N., Varma, V., Kramár, J., Dragan, A., Shah, R., & Nanda, N. (2024). Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2. arXiv Preprint arXiv:2408.05147.
MITRE. (n.d.).
ATLAS: Adversarial threat landscape for artificial-intelligence systems.
https://atlas.mitre.org/.
National Institute of Standards and Technology. (2023).
Artificial intelligence risk management framework (AI RMF 1.0) (NIST AI 100-1). NIST.
https://doi.org/10.6028/NIST.AI.100-1
Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., & Carter, S. (2020). Zoom in: An introduction to circuits.
Distill.
https://doi.org/10.23915/distill.00024.001
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems (NeurIPS).
OWASP Foundation. (2025).
OWASP top 10 for large language model applications.
https://owasp.org/www-project-top-10-for-large-language-model-applications/.
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems (NeurIPS).
Sharma, M. et al. (2025). Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming. arXiv Preprint arXiv:2501.18837.
Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., et al. (2024). Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet. Transformer Circuits Thread (Anthropic).
Zhang, H. et al. (2024). Agent security bench (ASB): Formalizing and benchmarking attacks and defenses in LLM-based agents. arXiv Preprint arXiv:2410.02644.
Zou, A., Phan, L., Wang, J., Duenas, D., Lin, M., Andriushchenko, M., Wang, R., Kolter, Z., Fredrikson, M., & Hendrycks, D. (2024). Improving alignment and robustness with circuit breakers. arXiv Preprint arXiv:2406.04313.
Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., & Fredrikson, M. (2023). Universal and transferable adversarial attacks on aligned language models. arXiv Preprint arXiv:2307.15043.