February 13

Add to Calendar 2025-02-13 16:00:00 2025-02-13 17:00:00 America/New_York Embodied Red Teaming for Auditing Robotic Foundation Models Abstract: Language-conditioned robot models have the potential to enable robots to perform a wide range of tasks based on natural language instructions. However, assessing their safety and effectiveness remains challenging because it is difficult to test all the different ways a single task can be phrased.Current benchmarks have two key limitations: they rely on a limited set of human-generated instructions, missing many challenging cases, and focus only on task performance without assessing safety, such as avoiding damage. To address these gaps, we introduce Embodied Red Teaming (ERT), a new evaluation method that generates diverse and challenging instructions to test these models. ERT uses automated red teaming techniques with Vision Language Models (VLMs) to create contextually grounded, difficult instructions. Experimental results show that state-of-the-art language-conditioned robot models fail or behave unsafely on ERT-generated instructions, underscoring the shortcomings of current benchmarks in evaluating real-world performance and safety. Bio: Zhang-Wei obtained his Ph.D. from Electrical Engineering and Computer Science (EECS) at the Massachusetts Institute of Technology (MIT), advised by Prof. Pulkit Agrawal. His research interest lies in reinforcement learning algorithms and its applications on language models and robotics. In 2024, Zhang-Wei was awarded the Qualcomm Fellowship for North America. He completed both his B.S. and M.S. degrees at National Tsing Hua University. TBD

December 05

Add to Calendar 2024-12-05 16:00:00 2024-12-05 17:00:00 America/New_York Inference-Time Policy Customization Through Interactive Task Specification Abstract: Imitation learning has driven the development of generalist policies capable of autonomously solving multiple tasks. However, when a pretrained policy makes errors during deployment, there are limited mechanisms for users to customize its behavior. While collecting additional data for fine-tuning can address such issues, doing so for each downstream use case is inefficient at scale. My research proposes an alternative perspective: framing policy errors as task mis-specifications rather than skill deficiencies. By enabling users to specify tasks unambiguously at inference-time, the appropriate skill for a given context can be retrieved without fine-tuning. Specifically, I propose (1) inference-time steering, which leverages human interactions for single-step task specification, and (2) task and motion imitation, which uses symbolic plans for multi-step task specification. These frameworks correct misaligned policy predictions without requiring additional training, maximizing the utility of pretrained models while achieving inference-time user objectives.Bio: Felix Yanwei Wang is a final-year PhD candidate in Electrical Engineering and Computer Science (EECS) at MIT, advised by Prof. Julie Shah. His research focuses on adapting pretrained manipulation policies for human-robot interaction. He earned his Bachelor's degree from Middlebury College and his Master's degree from Northwestern University. He has also worked under the guidance of Prof. Dieter Fox at the NVIDIA Robotics Lab. Felix is a recipient of the MIT Presidential Fellowship and the Work of the Future Fellowship in Generative AI at MIT. His research has been recognized with oral and spotlight presentations at CoRL and ICLR, featured on PBS, and is currently exhibited at the MIT Museum. 32-G449

November 21

Add to Calendar 2024-11-21 16:00:00 2024-11-21 17:00:00 America/New_York Redefining Context for Powerful Test-Time Adaptation Using Unlabeled Data Abstract: Foundation models, while powerful, often struggle under distribution shifts in unfamiliar domains, typically requiring costly data collection and retraining to maintain performance. Test-Time Adaptation (TTA) has emerged as a promising approach to address these limitations, enabling models to adapt dynamically to new target domains at test time. In this talk, I will present TTA approaches by rethinking the notion of “context”—an abstract concept drawn from in-context learning—to address two fundamental challenges: improving out-of-distribution generalization and aligning representations with varying task-specific inductive biases, such as fairness constraints. Specifically, we explore two ways of leveraging unsupervised in-context learning, allowing models to use unlabeled data to adapt their behavior flexibly. First, we will demonstrate how using unlabeled domain data as context can align models with diverse distributions, enhancing their robustness in changing environments. Next, we will extend this idea to further improve this alignment by enforcing task-specific inductive priors. Together, these approaches showcase the potential of unsupervised, context-driven TTA to address key challenges of current-generation foundation models. Finally, we will explore the broader implications of this context-driven perspective for building world models, planning, and robust decision-making.Bio: Sharut Gupta is a third-year PhD candidate in Electrical Engineering and Computer Science (EECS) at MIT, advised by Prof. Stefanie Jegelka. Her research interests focus on multi-modal representation learning, robustness, and out-of-distribution generalization. She received her Bachelor’s and Master’s (Dual) degrees from the Indian Institute of Technology Delhi (IIT Delhi), where she completed her thesis research with Prof. Yoshua Bengio on "A Causal Perspective on Efficient Distributed Systems”. Sharut is a recipient of the MIT Presidential Fellowship and has completed research internships at FAIR (Meta AI) and Google DeepMind. 32-G449 (Stata Center, Kiva / Patil Conference Room)

November 14

Add to Calendar 2024-11-14 16:00:00 2024-11-14 17:00:00 America/New_York Recent Progress on Foundation Model Supervision for Robot Learning Abstract: Achieving general-purpose robotics requires robots to quickly learn diverse tasks without extensive training data or hand-engineered controllers for each scenario. While recent efforts in crowd-sourcing robot datasets have expanded available training data, these remain orders of magnitude smaller than datasets used in vision or language foundation models. Rather than solely focusing on scaling robot data, my research develops algorithms that train new and leverage existing foundation models from non-robot domains to provide scalable supervision across diverse robot embodiments, tasks, and policy learning approaches -- in short, enabling robot learning from foundation model supervision. This approach enables automated task learning while bypassing labor-intensive controller design and data collection.In this talk, I will present some recent progress in these directions. First, I will discuss Eurekaverse, a LLM-based environment curriculum generation algorithm that enables acquisition of complex parkour skills in the real world. Second, I will present Generative Value Learning, a new approach for universal value function enabled by long-context VLM in-context learning.Bio: Jason Ma is a final-year PhD student at the University of Pennsylvania. His research interests include foundation models for robotics, robot learning, and reinforcement learning. His work has received Best Paper Finalist at ICRA 2024, Top 10 NVIDIA Research Projects of 2023, and covered by popular media such as the Economist, Fox, Yahoo, and TechCrunch. Jason is supported by Apple Scholar in AI/ML PhD Fellowship as well as OpenAI Superalignment Fellowship. 45-792 (Schwarzman College of Computing)

November 07

The Promises and Pitfalls of Open-source Agent Systems

Tim Dettmers
Carnegie Mellon University / Allen Institute for AI
Add to Calendar 2024-11-07 16:00:00 2024-11-07 17:00:00 America/New_York The Promises and Pitfalls of Open-source Agent Systems Abstract: Agent systems, AI systems that make their own plans and act on them, have shown promising results particularly for coding-changes such as SWE-bench. However, currently, most agent systems rely on closed-source API models such as GPT-4o and Claude as it is believed that open-source models do not have the capabilities to make up successful agent systems. In this talk, I show that agent systems powered by open-source models can match the performance of systems based on GPT-4o. This implies that for good task performance how you use a model is much more important than what model you use. I also discuss problems with agent system generalization and high variability in evaluation that shows we need to be cautious when making scientific claims about agent systems. I will argue that we will need to focus on these generalization and evaluation challenges to make steady scientific progress.Bio: Tim Dettmers is a Research Scientist at the Allen Institute for AI and an Assistant Professor at Carnegie Mellon University. His work focuses on making foundation models, such as ChatGPT, accessible to researchers and practitioners by reducing their resource requirements. His main focus is to develop high-quality agent systems that are open-source and can be run on consumer hardware, such as laptops. His research won oral, spotlight, and best paper awards at conferences such as ICLR and NeurIPS and was awarded the Block Award and Madrona Prize. He created the bitsandbytes open-source library for efficient foundation models, which is growing at 2.2 million installations per month, and for which he received Google Open Source and PyTorch Foundation awards. 32-G449 (Stata Center, Patil/Kiva Conference Room)

October 24

Add to Calendar 2024-10-24 16:00:00 2024-10-24 17:00:00 America/New_York Aligning Language Models with LESS Data and a Simple (SimPO) Objective Abstract:Aligning pre-trained language models ensures they follow human instructions reliably to produce helpful and harmless outputs. Supervised fine-tuning and preference optimization are two key approaches for achieving this goal. In this talk, I will introduce two novel algorithms designed to enhance these two stages. First, I introduce LESS, a model- and optimizer-aware algorithm for data selection. LESS leverages a few curated examples to identify instruction-tuning data that fosters specific capabilities in the model. It avoids relying on surface-form cues by framing data selection as an optimization problem, aiming to minimize the loss on a target dataset (e.g., validation). Our experiments show that training on just 5% of the data selected by LESS outperforms training on the full dataset, with the selected data often transferable across different model sizes and families. Next, I will introduce a simple yet effective algorithm for model alignment, SimPO, which utilizes a reference-free reward formulation based on the average likelihood of model responses. Extensive experiments demonstrate that SimPO outperforms existing offline preference optimization methods, such as DPO, across various settings. Notably, the Gemma2-9B model, tuned with SimPO, achieved the highest rank among <10B models on Chatbot Arena, AlpacaEval 2, and WildBench.Bio:Mengzhou Xia is a final-year PhD student in Computer Science at Princeton University, advised by Danqi Chen. Her research focuses on developing algorithms to build effective language models via data-centric approaches and objective designs under an academic budget. She received her master's degree from Carnegie Mellon University, where she worked with Graham Neubig and her bachelor's degree from Fudan University in China. Mengzhou is a recipient of the 2024 Apple Scholars in AI/ML PhD Fellowship and the 2022 Bloomberg Data Science PhD Fellowship. She has also been awarded as a 2024 MIT EECS Rising Star. Throughout her PhD, she has interned at Meta AI, Microsoft Research, and Bloomberg AI. 32-G449 (Stata Center, Patil / Kiva Conference Room)

October 17

Add to Calendar 2024-10-17 16:00:00 2024-10-17 17:00:00 America/New_York Aligning Robot and Human Representations Abstract: To perform tasks that humans want in the world, robots rely on a representation of salient task features; for example, to hand me a cup of coffee, the robot considers features like efficiency and cup orientation in its behavior. Prior methods try to learn both a representation and a downstream task jointly from data sets of human behavior, but this unfortunately picks up on spurious correlations and results in behaviors that do not generalize. In my view, what’s holding us back from successful human-robot interaction is that human and robot representations are often misaligned: for example, our assistive robot moved a cup inches away from my face -- which is technically collision-free behavior -- because it lacked an understanding of personal space. Instead of treating people as static data sources, my key insight is that robots must engage with humans in an interactive process for finding a shared representation for more efficient, transparent, and seamless downstream learning. In this talk, I focus on a divide and conquer approach: explicitly focus human input on teaching robots good representations before using them for learning downstream tasks. This means that instead of relying on inputs designed to teach the representation implicitly, we have the opportunity to design human input that is explicitly targeted at teaching the representation and can do so efficiently. I introduce a new type of representation-specific input that lets the human teach new features, I enable robots to reason about the uncertainty in their current representation and automatically detect misalignment, and I propose a novel human behavior model to learn robust behaviors on top of human-aligned representations. By explicitly tackling representation alignment, I believe we can ultimately achieve seamless interaction with humans where each agent truly grasps why the other behaves the way they do.Bio: Andreea Bobu is an Assistant Professor at MIT in AeroAstro and CSAIL. She leads the Collaborative Learning and Autonomy Research Lab (CLEAR Lab), where they develop autonomous agents that learn to do tasks for, with, and around people. Her goal is to ensure that these agents' behavior is consistent with human expectations, whether they interact with expert designers or novice users. She obtained her Ph.D. in Electrical Engineering and Computer Science at UC Berkeley with Anca Dragan in 2023. Prior to her Ph.D. she earned her Bachelor’s degree in Computer Science and Engineering from MIT in 2017. She was the recipient of the Apple AI/ML Ph.D. fellowship, is a Rising Star in EECS and an R:SS and HRI Pioneer, and has won best paper award at HRI 2020 and the Emerging Research Award at the International Symposium on the Mathematics of Neuroscience 2023. Before MIT, she was also a Research Scientist at the AI Institute and an intern at NVIDIA in the Robotics Lab. 32-G449 (Stata Center, Patil/Kiva Conference Room)

October 10

Add to Calendar 2024-10-10 16:00:00 2024-10-10 17:00:00 America/New_York On Building General, Zero-Shot Robot Policies Abstract:Robot models, particularly those trained with large amounts of data, have recently shown a plethora of real-world manipulation and navigation capabilities. Several independent efforts have shown that given sufficient training data in an environment, robot policies can generalize to demonstrated variations in that environment. However, needing to finetune robot models to every new environment stands in stark contrast to models in language or vision that can be deployed zero-shot for open-world problems. In this talk, I will present Robot Utility Models (RUMs), a framework for training and deploying zero-shot robot policies that can directly generalize to new environments without any finetuning. To create RUMs efficiently, we developed new tools to quickly collect data for mobile manipulation tasks, integrate such data into a policy with multi-modal imitation learning, and deploy policies on-device on Hello Robot Stretch, a cheap commodity robot, with an external mLLM verifier for retrying. We trained five such utility models for opening cabinet doors, opening drawers, picking up napkins, picking up paper bags, and reorienting fallen objects. Our system, on average, achieves 90% success rate in unseen, novel environments interacting with unseen objects. Moreover, the utility models can also succeed in different robot and camera set-ups with no further data, training, or fine-tuning. I will talk about our primary lessons from training RUMS: namely the importance of training data over training algorithm and policy class, guidance about data scaling, necessity for diverse yet high-quality demonstrations, and a recipe for robot introspection and retrying to improve performance on individual environments. All the code and data, and models I will talk about have been open sourced in our website, https://robotutilitymodels.com/Bio:Nur Muhammad “Mahi” Shafiullah is a Ph.D. student at NYU Courant Institute advised by Lerrel Pinto. His research is driven by a vision of robots seamlessly integrated into our messy everyday lives: automating problems and continuously learning alongside us. Mahi's recent work has developed new algorithms for learning robotic behavior, large robot models for robust manipulation, and spatio-semantic memory that can handle dynamic changes in the world. He is passionate about getting these models and algorithms out in the real-world, operating autonomously in NYC homes. His work has been featured in Oral and Spotlight presentations and demos at conferences like ICRA, RSS, NeurIPS, ICML, and ICLR. Mahi is supported by the Apple Fellowship, the Jacob T. Schwarz Fellowship, and was a visiting scientist at Meta. In a past life, Mahi was a Silver medalist at IMO and worked on adversarial robustness as an undergrad at MIT (S.B. ‘19). 32-G449 (Patil / Kiva Conference Room)

October 03

Add to Calendar 2024-10-03 16:00:00 2024-10-03 17:00:00 America/New_York Foundations of High-Modality Multisensory AI Abstract: Building multisensory AI that learns from text, speech, video, real-world sensors, wearable devices, and medical data holds promise for impact in many scientific areas with practical benefits, such as supporting human health and well-being, enabling multimedia content processing, and enhancing real-world autonomous agents. However, multimodal systems quickly run into data and modeling bottlenecks: it is increasingly difficult to collect paired multimodal data and scale multimodal transformers as the number of modalities and their dimensionality grows. In this talk, I propose a vision of high-modality learning: building multimodal AI over many diverse input modalities, given only partially observed subsets of data or model representations. We will cover 2 key ideas to enable high-modality learning: (1) discovering how modalities interact to give rise to new information, and (2) tackling the heterogeneity over many different modalities. Finally, I will discuss our collaborative efforts in scaling AI to many modalities and tasks for real-world impact on affective computing, mental health, and cancer prognosis.Bio: Paul Liang is an Assistant Professor at MIT Media Lab and MIT EECS. His research advances the foundations of multisensory artificial intelligence to enhance the human experience. He is a recipient of the Siebel Scholars Award, Waibel Presidential Fellowship, Facebook PhD Fellowship, Center for ML and Health Fellowship, Rising Stars in Data Science, and 3 best paper awards. Outside of research, he received the Alan J. Perlis Graduate Student Teaching Award for developing new courses on multimodal machine learning. 32-G449 (Stata Center, Patil/Kiva Conference Room)

September 26

Add to Calendar 2024-09-26 16:00:00 2024-09-26 17:00:00 America/New_York Learning Robust, Real-world Visuomotor Skills from Generated Data Abstract: The mainstream approach in robot learning today relies heavily on imitation learning from real-world human demonstrations. These methods are sample efficient in controlled environments and easy to scale to a large number of skills. However, I will present algorithmic arguments to explain why merely scaling up imitation learning is insufficient for advancing robotics. Instead, my talk will focus on developing performant visuomotor policies in simulation and the techniques that make them robust enough to transfer directly to real-world color observations.I will introduce LucidSim, our recent breakthrough in producing real-world perceptive robot policies from synthetic data. Using only generated images, we successfully trained a robot dog to perform parkour through obstacles at high speed, relying solely on a color camera for visual input. I will discuss how we generate diverse and physically accurate image sequences within simulated environments for learning, and address the system challenges we overcame to scale up. Finally, I will outline our push for versatility and plans to acquire three hundred language-aware visuomotor skills by the end of this year. These are the first steps toward developing fully autonomous, embodied agents that require deeper levels of intelligence.Bio: Ge Yang is a postdoctoral researcher working with Phillip Isola at MIT CSAIL. His research focuses on developing the algorithmic and system foundations for computational visuomotor learning, with an emphasis on learning from synthetic data and sim-to-real transfer. Ge's work is dedicated to making robots capable, versatile, and intelligent.Before transitioning into AI and robotics, Ge earned his Ph.D. in Physics from the University of Chicago and a Bachelor of Science in Mathematics and Physics from Yale University. His experience in physics motivated a multidisciplinary approach to problem-solving in AI. He is a recipient of the NSF Institute of AI and Fundamental Interactions Postdoc Fellowship and the Best Paper Award at the 2024 Conference on Robot Learning (CoRL), selected from 499 submissions. 32-G449 (Stata Center, Patil-Kiva Conference Room)

September 19

Add to Calendar 2024-09-19 16:00:00 2024-09-19 17:00:00 America/New_York Cultural Biases, World Languages, and Privacy Protection in Large Language Models Abstract: In this talk, I will highlight three key aspects of large language models: (1) cultural bias in LLMs and pre-training data, (2) decoding algorithm for low-resource languages, and (3) human-centered design for real-world applications.The first part focuses on systematically assessing LLMs' favoritism towards Western culture. We take an entity-centric approach to measure the cultural biases among LLMs (e.g., GPT-4, Aya, and mT5) through natural prompts, story generation, sentiment analysis, and named entity tasks. One interesting finding is that a potential cause of cultural biases in LLMs is the extensive use and upsampling of Wikipedia data during the pre-training of almost all LLMs. The second part will introduce a constrained decoding algorithm that can facilitate the generation of high-quality synthetic training data for fine-grained prediction tasks (e.g., named entity recognition, event extraction). This approach outperforms GPT-4 on many non-English languages, particularly low-resource African languages. Lastly, I will showcase an LLM-powered privacy preservation tool designed to safeguard users against the disclosure of personal information. I will share findings from an HCI user study that involves real Reddit users utilizing our tool, which in turn informs our ongoing efforts to improve the design of AI models.Concluding the talk, I will briefly touch upon recent research exploring the temporal robustness of large language models (e.g., handling neologisms) and advances in human-AI interactive evaluation of LLM-generated texts.Bio: Wei Xu is an Associate Professor in the College of Computing and Machine Learning Center at the Georgia Institute of Technology, where she is the director of the NLP X Lab. Her research interests are in natural language processing and machine learning, with a focus on Generative AI, robustness and fairness of large language models, multilingual LLMs, as well as interdisciplinary research in AI for science, education, accessibility, and privacy. She is a recipient of the NSF CAREER Award, AI for Everyone Award, Best Paper Award and Honorable Mention at COLING'18, ACL’23. She also received research funds from DARPA and IARPA. She is currently an executive board member of NAACL. 45-792 (Schwarzman College of Computing)