Composing Foundation Models for Decision Making

Speaker

MIT CSAIL

Host

Pulkit Agrawal

MIT CSAIL

Abstract: Recent advancements in conditional generative modeling have enabled models like DALL-E and GPT-4 to generate high-resolution images and coherent text from brief prompts. However, developing a foundation model for decision-making is hindered by the scarcity and expense of collecting paired visual, language, and action data. To address this challenge, this thesis proposes a scalable alternative: a compositional model architecture that leverages separately trained expert models specializing in language, vision, and action. By reducing the need for extensive paired data collection, this approach maintains efficiency in solving novel decision-making tasks while mitigating the data scarcity problem. Our compositional foundation model employs a large language model for task planning, a video diffusion model to generate detailed video trajectories, and an inverse dynamics model to map videos into actions. We demonstrate the effectiveness of this approach in the context of table-top manipulation tasks. Furthermore, given the application of foundation models across various embodied agents, there is a growing need for systematically evaluating these models' "common sense" understanding of the world. This evaluation is crucial for the successful deployment of embodied agents in real-world scenarios. To address this need, we introduce the first open-vocabulary benchmark for Embodied Question Answering (EQA). This benchmark assesses the foundation models' ability to comprehend and reason about the world. In summary, by addressing data scarcity in developing foundation models for decision-making and establishing a benchmark for evaluating the reasoning capabilities of embodied agents, this thesis aims to advance the development of foundation models for decision-making.

Bio: Anurag Ajay is a Ph.D. student advised by Prof. Pulkit Agrawal in Improbable AI lab at MIT CSAIL. His research focuses on offline RL, generative modeling, and foundation models for decision-making. His papers have received oral talks at ICLR and ICML and the best cognitive robotics paper award at IROS.

Thesis committee: Pulkit Agrawal, Leslie Kaelbling, Dale Schuurmans

Add to Calendar 2024-06-21 13:00:00 2024-06-21 14:00:00 America/New_York Composing Foundation Models for Decision Making Abstract: Recent advancements in conditional generative modeling have enabled models like DALL-E and GPT-4 to generate high-resolution images and coherent text from brief prompts. However, developing a foundation model for decision-making is hindered by the scarcity and expense of collecting paired visual, language, and action data. To address this challenge, this thesis proposes a scalable alternative: a compositional model architecture that leverages separately trained expert models specializing in language, vision, and action. By reducing the need for extensive paired data collection, this approach maintains efficiency in solving novel decision-making tasks while mitigating the data scarcity problem. Our compositional foundation model employs a large language model for task planning, a video diffusion model to generate detailed video trajectories, and an inverse dynamics model to map videos into actions. We demonstrate the effectiveness of this approach in the context of table-top manipulation tasks. Furthermore, given the application of foundation models across various embodied agents, there is a growing need for systematically evaluating these models' "common sense" understanding of the world. This evaluation is crucial for the successful deployment of embodied agents in real-world scenarios. To address this need, we introduce the first open-vocabulary benchmark for Embodied Question Answering (EQA). This benchmark assesses the foundation models' ability to comprehend and reason about the world. In summary, by addressing data scarcity in developing foundation models for decision-making and establishing a benchmark for evaluating the reasoning capabilities of embodied agents, this thesis aims to advance the development of foundation models for decision-making.Bio: Anurag Ajay is a Ph.D. student advised by Prof. Pulkit Agrawal in Improbable AI lab at MIT CSAIL. His research focuses on offline RL, generative modeling, and foundation models for decision-making. His papers have received oral talks at ICLR and ICML and the best cognitive robotics paper award at IROS.Thesis committee: Pulkit Agrawal, Leslie Kaelbling, Dale Schuurmans 32-D463

Organizer & Contact

Anurag Ajay

aajay@mit.edu

Composing Foundation Models for Decision Making

Speaker

Host

June 21 2024

Location

Organizer & Contact