May 12, 2025

Ensuring the Future of Safe AI: Perspectives from IVADO and Simons’ Safety-Guaranteed LLMs Workshop in Berkeley

Amid rapid advances in artificial intelligence, IVADO and the Simons Institute for the Theory of Computing at Berkeley joined forces for their forward-thinking “Safety-Guaranteed LLMs” workshop. The collaborative event explored groundbreaking approaches to ensuring superintelligent language models remain safe, ethical, and aligned with human values. Drawing on theoretical foundations including Bayesian and reinforcement learning perspectives, the workshop bridged current challenges with visionary solutions.

This event, the third major workshop of our Thematic Semester on Large Language Models and Transformers, brought together leading researchers working at the intersection of artificial intelligence, safety, and theoretical computer science.

IVADO sponsored a delegation of 32 professors and students from across Canada to participate in this important gathering from April 14th to 18th. Several of our students and speakers generously shared their impressions and reflections with us, which we are pleased to highlight here. Their perspectives offer a glimpse into the key themes, ideas, and emerging directions that shaped the conversations throughout the week.

Key Themes and Impressions from Participants:

Beyond Training: Building Guardrails at Every Level

A recurring theme throughout the workshop was the urgent need for safety mechanisms that extend beyond the training phase. Talks such as Boaz Barak’s presentation on “AI Safety via Inference-Time Compute” and Ethan Perez’s session on “Controlling Untrusted AIs with Monitors” inspired discussion on practical strategies for building dynamic, real-time oversight into deployed LLMs to detect and mitigate risks beyond their initial design.

The Urgent Need for Robustness

Participants reflected on how real-world vulnerabilities like jailbreaks and adversarial attacks are no longer speculative. Presentations by Gauthier Gidel on adversarial training and Siva Reddy on the robustness of jailbreak resistance highlighted why it is critical to stress-test models systematically against adversarial threats—an area that must move to the forefront of LLM development.

Additional talks by Jacob Steinhardt, Aditi Raghunathan, and Dan Hendrycks brought valuable perspectives on generalization, robustness to distributional shifts, and the limitations of current benchmarks—reinforcing how foundational these challenges are to building resilient AI systems.

Rethinking Alignment Beyond Preference Models

Many discussions challenged traditional alignment methods that rely heavily on simple preference optimization. In particular, sessions featuring Yoshua Bengio, Dawn Song, and Geoffrey Irving emphasized the importance of embedding deeper human norms, values, and societal principles into LLM objectives, encouraging participants to reframe how success in alignment should be defined and evaluated.

Talks such as Ryan Lowe’s on “Full-Stack Alignment” explored how technical alignment strategies can be informed by broader societal objectives — encouraging participants to think holistically about the multi-level challenges of building LLMs that are both useful and value-aligned.

Multi-Agent Dynamics: A Complex New Challenge

As LLMs increasingly operate as agents capable of interacting, negotiating, or conflicting, participants reflected on the urgent need to understand multi-agent safety dynamics. Talks such as Georgios Piliouras’s exploration of efficient debate games and Aaron Courville’s game-theoretic framing of AI systems provided vivid examples of how multi-agent settings introduce unpredictability and emergent failure modes—highlighting the need for safety strategies that go beyond individual model alignment.

Formal Safety Guarantees and New Methodologies

Several speakers showcased how techniques from theoretical computer science and cryptography are being adapted for AI safety. Discussions inspired by researchers working on formal verification and security proofs underlined a growing movement toward developing mathematically grounded guarantees against backdoors, exploits, and other risks in LLMs.

These conversations were enriched by insights from researchers like Shafi Goldwasser and Geoffrey Irving, who offered complementary perspectives on provable safety guarantees, and helped inspire participants to build new bridges between AI safety and cryptography.

Talks by Roger Grosse and Dhanya Sridhar also emphasized the importance of interpretability, with Dhanya exploring the role of causality and Roger highlighting how interpretability connects to safety in model behavior and design.

Governing Frontier AI Development

Participants were also reminded that technical progress must be complemented by governance innovation. Talks throughout the week, particularly those touching on public-good frameworks championed by Yoshua Bengio and others, stressed the need for incentives, oversight mechanisms, and international cooperation to ensure that frontier AI development remains aligned with societal values and interests.

The broader public’s engagement with AI risks also surfaced during the event. At one point, a talk was interrupted by members of the public calling for a complete stop on all AI development — a striking moment that reflected the rising public anxiety around frontier models and the urgency of building transparent, inclusive conversations about AI governance.

A Motivated and Collaborative Community

Throughout the week, participants remarked on the depth of discussion, the spirit of collaboration, and the openness with which researchers challenged assumptions and shared work-in-progress ideas. It was a powerful reminder that advancing AI safety is not only a technical challenge—it is a shared societal commitment.

We extend our deepest thanks to the speakers, participants, and organizers, including Siva Reddy, Yoshua Bengio, and Umesh Vazirani, for their leadership in shaping this event. A very special thank you as well to the Simons Institute staff and leadership for flawlessly hosting the workshop and creating such an inspiring, collaborative environment for everyone involved.

Continue the Conversation: Watch the Talks.

Recordings of the workshop sessions are available online:

Watch the Safety-Guaranteed LLMs Talks.

What’s Next

IVADO is proud to support the next generation of researchers working at the frontier of safe and responsible AI. We look forward to seeing how the insights from this workshop will continue to shape research, policy, and innovation in the years ahead.

Building on this momentum, IVADO will be launching a new Thematic Semester this fall: Autonomous LLM Agents: Risks and Scientific Challenges. This program will continue to explore critical questions around the capacity and safety of next-generation AI agents. Researchers interested in participating can find more details and registration information here:

Explore the Autonomous LLM Agents Semester.