Agents that Code, Browse, and Act: Highlights from a Four-Day Bootcamp

By: Hager Radi Abdelwahed – Senior Applied Research Scientist at Mila: Quebec AI Institute

Given the rapid progress for AI agents, IVADO organized the bootcamp “Focusing on the Current State of Agents” (August 12–15, 2025), as part of the Thematic Semester on Autonomous LLM Agents: Risks and Scientific Challenges. This four-day bootcamp united academic and industrial researchers to explore the current state of agentic systems. Each day had its own theme covering the following themes: 1) Coding Agents, 2) Web agents, 3) Robotics and Embodiment, 4) Multi-agent interaction.

Day 1: Agents that Code

The first day focused on agents designed for coding, whether to generate or debug code. Daniel Fried thoroughly discussed the state of the art on coding agents, demonstrating how LLMs can answer questions by generating code. He highlighted research like TroVE, which enables agents to add tools and functions dynamically. Fried also covered evaluation benchmarks such as RepoST, which involve mocking external APIs, synthesizing inputs and comparators, and unit-testing LLM-built tools against real code. Building on this, SWE-RL advanced reasoning-centric reinforcement learning (RL) to successfully perform real GitHub PR fixes.

Expanding on the topic of benchmarks, Ofir Press detailed various efforts to create reliable evaluation metrics for coding agents. He mentioned SWE-bench, SWE-bench-live, and SWE-bench-mini as attempts to assess LLM agents on software engineering tasks like bug solving and GitHub issue resolution. He also shared a concise 100-line SWE agent.

The day concluded with a tutorial by Microsoft Montreal (Marc-Alexandre Côté and Alessandro Sordoni), offering participants the opportunity to explore debug-gym, a tool that enables code debugging through text interactions with an agent.

Day 2: Web Agents at Scale

Siva Reddy initiated a discussion on the path to safe web agents, drawing a parallel between web agents and semantic parsing. He presented various benchmarks such as WebArena, VisualWebArena, and WorkArena for evaluating LLM web agents. A key question raised was whether new web interfaces are necessary for agents, distinct from those designed for human interaction. He concluded that LLM alignment exhibits poor transferability to web agents, LLMs perform poorly as judges for web agents, and there’s a need for stronger multi-modal LLMs specifically tailored for web agents.

Following this, Victor Zhong addressed the topic of generalist agents, highlighting the utility of benchmarks like OSWorld in assessing web agents within real-world computer use cases.

Subsequently, Xin Eric Wang discussed that a robust generalist agent can be a weak grounding model. He emphasized that the key lies in enabling a generalist model to operate on semantic reasoning while concurrently distributing cognitive load to specialized models. He challenged the prevailing reliance on web interfaces, suggesting that agents could alternatively interact with computers via command line and code.

The day’s tutorial featured Alexandre Lacoste of ServiceNow, who delved into the process of building and evaluating autonomous web agents. He elaborated on ongoing efforts to unify benchmarks, aiming to unify evaluation methods under a single, comprehensive meta-benchmark. Echoing earlier discussions, he reassured attendees that developing web agents is straightforward given the availability of strong VLMs. However, he cautioned that evaluating agents is challenging, and fooling them is easy, introducing a new landscape of security threats.

Day 3: Robotics: from Maps to Multimodal Policies

Liam Paull initiated the discussion by outlining the traditional approaches to robot operation, followed by an exploration of how foundation model representations can be integrated and the potential for creating generalist foundation models in robotics. He highlighted ongoing challenges for robotic generalist models, including data scarcity, common sense reasoning, and robustness. Given VLMs do not directly output control values, he discussed new work using Vision-Language-Action (VLA) models such OpenVLA, π0.5 and Poutine.

Joyce Chai continued the conversation, elaborating on the integration of VLMs and VLAs into robots. She emphasized the diverse applications of Large Language Models (LLMs) in cognitive robots, including grounding, manipulation, and navigation. Furthermore, she discussed how language can function as a reasoning tool for robots, extending beyond its role as a communication channel. She also explored the potential of leveraging LLMs with simulation for data synthesis, which can then be utilized for pretraining and fine-tuning foundation models for robots.

Jana Pavlasek highlighted the necessity for solutions that scale to the real world in unstructured environments for real robots. She advocated for models that combine machine learning with probabilistic inference to provide effective inductive biases for robotics.

Glen Berseth concluded the day with a hands-on tutorial on building General Robot policies with code. He stressed that supervised learning is insufficient for generalization and stressed the importance of developing robots capable of acting upon a wide range of human commands. He showcased recent work on large scale datasets and foundation models for robots such as Open-X-Embodiment and PaLM-E.

While the day’s discussions might suggest robots are close to fully taking over human tasks, it was ultimately confirmed that they still have a considerable journey ahead.

Day 4: Multi-Agent Reasoning and Learning.

From single to multi-agent systems, Alane Suhr presented the complexities of multi-agent systems, where agents must build world models from diverse perspectives, understand other agents’ goals, and even reason about their thinking (Theory of Mind). She discussed tools like language use and interaction dynamics modeling to address these challenges, emphasizing their importance for human-agent collaboration, citing Anthropic’s multi-agent research system. Suhr also cautioned about LLMs’ limitations in accurately modeling social dynamics and replicating stereotypes in multi-agent simulations, questioning their ability to replace human subjects research.

Next, Natasha Jaques discussed the application of multi-agent reinforcement learning (MARL) to large language models (LLMs), specifically how self-play can be leveraged to train safe LLMs. She explained that MARL can facilitate the joint training of a defender and an attacker, where the attacker is incentivized to generate more attacks, prompting the defender to react and reason through these challenges.

Yoav Artzi then presented how LLMs can learn from conversational interactions through retrospective learning. He demonstrated how LLMs can serve as feedback decoders, capable of providing implicit conversational feedback, even in tasks where they initially perform poorly. This allows them to extract learning signals from past interactions and bootstrap their capabilities. He also introduced the concept of In-context RL (ICRL) for LLMs, which helps characterize in-context behavior and enables rapid steering of LLMs without explicit updates, leading to online, interactive, continual learning and ad-hoc conventions.

Quentin Bertrand concluded the day with a tutorial where participants coded multi-agents playing the iterated prisoner’s dilemma. He highlighted the difficulty of fostering cooperation among agents and noted that cooperation is not always a desirable feature, as it can lead to collusion. The future direction involves modeling cooperation in complex environments with more players, more actions, and larger state spaces, with the hope that deep RL can address these challenges.

That’s a wrap! Scheduled next: Ivado will host:

1st Workshop: Assessing and Improving the Capabilities and Safety of Agents, from 3-6 October, 2025.

2nd Workshop: Deploying Autonomous Agents: Lessons, Risks, and Real-World Impact, from 17-20 November, 2025

Back to articles