Embodied Intelligence 2024-2025

Back to Events

Seminar Series

May 27

Neural Robot Navigation with Foundational and Bio-inspired Models

Younès Raoui

Mohammed V University in Rabat, Morocco

Part Of

Embodied Intelligence 2024-2025

4:00P

- 5:00P

Location

Stata Center (Building 32) (32 Vassar Street)

Add to Calendar 2025-05-27 16:00:00 2025-05-27 17:00:00 America/New_York Neural Robot Navigation with Foundational and Bio-inspired Models Abstract: Recent neural robot navigation methods use both end-to-end vector-based and bio-inspired approaches. Foundational models based on reinforcement learning, transformers, or liquid neural networks can be applied for planning, navigation, or visual place recognition. Bio-inspired methods model place cells, grid cells, and head direction cells using continuous attractor networks or spiking neural networks.My research focuses on both of these approaches. In this seminar, I will present a tutorial on neural robot navigation methods, followed by our method, called Neural Object SLAM, which creates experience maps by training a neural model of place cells, grid cells, and head direction cells using inputs from visual objects and internal sensory data.Bio: Dr. Younès Raoui is an Assistant Professor at the Faculty of Sciences (FSR) in the physics department at Mohammed V University in Rabat, Morocco. His current research focuses on neuro-robotics, navigation and mapping for autonomous mobile robots, and nonlinear and optimal control for mobile robots. He has published numerous journal and conference articles in these areas. He has also held visiting appointments with the Knowledge Technology Lab at the University of Hamburg and the French National Institute of Health and Medical Research (INSERM).Dr. Younès Raoui earned his PhD in a co-supervision program between Mohammed V University and the Institut National Polytechnique de Toulouse (INPT). His doctoral research was conducted at the Laboratoire d Analyse et d Architecture des Systèmes (LAAS-CNRS) and the Faculty of Sciences of Rabat (LIMIARF Lab) team. He obtained a Masters degree in Computer Science, Telecommunications, and Multimedia from the Faculty of Sciences and the National Institute of Posts and Telecommunications (INPT) in Rabat. He also holds a Maîtrise Sciences et Techniques degree in Informatique, Électronique, Électrotechnique et Automatique from the Faculty of Sciences and Techniques in Settat (Morocco).https://www.raouiyounes.com/ TBD

May 08

DeltaNet and Beyond: The Next Generation of Scalable RNNs

Songlin Yang

MIT Computer Science and Artificial Intelligence Laboratory (CSAIL)

Part Of

Embodied Intelligence 2024-2025

4:00P

- 5:00P

Location

Stata Center (Building 32) (32 Vassar Street)

Add to Calendar 2025-05-08 16:00:00 2025-05-08 17:00:00 America/New_York DeltaNet and Beyond: The Next Generation of Scalable RNNs Abstract: Hardware-efficient variants of RNNs are receiving renewed attention for their scalability in training and inference, offering an attractive alternative to self-attention. Notable examples include Mamba, RWKV, GLA, DeltaNet, and xLSTM. In this talk, I will introduce DeltaNet, a linear RNN that combines strong in-context retrieval and state tracking with hardware-efficient training. I will motivate DeltaNet from an in-context learning perspective and discuss strategies for scaling it effectively. The talk will also explore DeltaNet’s connections to recent developments such as test-time training (TTT) and Titans, along with emerging extensions including Gated DeltaNet, RWKV-7, DeltaProduct, LongHorn and Mesa layer. Bio: Songlin Yang is a second-year Ph.D. student at MIT CSAIL, advised by Prof. Yoon Kim. Her research focuses on hardware-aware algorithms for efficient sequence modeling, with a particular emphasis on linear attention models. She is the lead contributor to the Flash Linear Attention library.   TBD

May 01

Speech Generation and Sound Understanding in Era of Large Language Models

David Harwath

University of Texas, Austin

Part Of

Embodied Intelligence 2024-2025

4:00P

- 5:00P

Location

45-792 (Schwarzman College of Computing)

51 Vassar Street, Cambridge, MA 02139

Add to Calendar 2025-05-01 16:00:00 2025-05-01 17:00:00 America/New_York Speech Generation and Sound Understanding in Era of Large Language Models Abstract:Transformer-based large language models (LLMs) have rapidly risen to dominance in the NLP field. One of the most exciting developments in this line of research is the finding that LLMs can be easily extended to handle multimodal inputs, such as vision or speech, via tokenization and concatenation with natural language inputs. In this talk, I will discuss several of my group's recent research directions into expanding the capabilities of multimodal LLMs to process speech and spatial audio signals. In the first half of my talk, I will present my group’s work on VoiceCraft, a neural codec language model which is capable of performing targeted edits of speech recordings where words can be arbitrarily inserted, deleted, or substituted in the waveform itself. These edits preserve the speaker’s voice, prosody, and speaking style, while leaving the non-edited regions of the waveform completely intact. Subjective human evaluations indicate that the naturalness of the edited speech is approximately on par with that of the un-edited speech, and when used for voice-cloning TTS our model outperforms commercial models such as VALL-E and XTTS-v2. In the second half of my talk, I will discuss our recent work on spatial sound understanding. Sound event localization and detection is a classic task in the speech and audio community, and involves predicting the class of a sound source as well as localizing it (e.g. predicting the direction of arrival). We extend this task to encompass higher-level reasoning about multiple sources within a physical environment by proposing the SpatialSoundQA dataset. This dataset contains over 800,000 ambisonic waveforms and accompanying question-answer pairs, and evaluates models on their ability to answer natural language questions such as “Is the sound of the telephone further to the left than the sound of the barking dog?” I will also describe our BAT model, an extension of the LLaMA LLM that is capable of taking spatial audio recordings as input and reasoning about them using natural language.  Bio:David Harwath is an assistant professor in the computer science department at UT Austin, where he leads the Speech, Audio, and Language Technologies (SALT) Lab. His group's research focuses on developing novel machine learning methods applied to speech, audio, and multimodal data for tasks such as automatic speech recognition, text to speech synthesis, and acoustic scene analysis. He has received the NSF CAREER award (2023), an ASRU best paper nomination (2015), and was awarded the 2018 George M. Sprowls Award for best computer science PhD thesis at MIT. He holds a B.S. in electrical engineering from UIUC (2010), a S.M. in computer science from MIT (2013), and a Ph.D. in computer science from MIT (2018).  TBD

April 24

Embodied Intelligence (EI) Joint Seminar Presentation

Hongyin Luo & Yung-Sung Chuang & Philip Schroeder

MIT CSAIL

Part Of

Embodied Intelligence 2024-2025

4:00P

- 5:00P

Location

Stata Center (Building 32) (32 Vassar Street)

Add to Calendar 2025-04-24 16:00:00 2025-04-24 17:00:00 America/New_York Embodied Intelligence (EI) Joint Seminar Presentation There will be a joint presentation this week by three MIT CSAIL members from the Spoken Language Systems group. Title: Quantifying Generalization Complexity for Large Language ModelsAbstract:  LLMs have shown remarkable performance in a range of complex tasks, but how well do they generalize beyond their training data distribution and how do we quantitatively measure such generalization? This talk presents our recent ICLR work on SCYLLA, an evaluation framework that disentangles generalization from memorization in LLMs. Using a dynamic evaluation approach, SCYLLA quantifies the generalization capabilities of LLMs across complexity levels, revealing key insights into their performance gaps between in-distribution (ID) and out-of-distribution (OOD) data. We will explore findings like the generalization valley — a non-monotonic relationship between task complexity and performance, which suggests a critical threshold where LLMs' reliance on non-generalizable behavior peaks. Additionally, we'll discuss critical complexity, which shifts as model size increases, suggesting that larger models can tackle more complex reasoning tasks before they begin to over-rely on memorization. This talk will also cover our benchmarking results across 28 popular LLMs, including both open-source models (e.g., LLaMA, Qwen) and closed models (e.g., Claude, GPT). The aim is to provide a clearer understanding of their generalization capabilities and help foster more robust methods for evaluating and augmenting LLMs.Bio:  Hongyin Luo is a research scientist at MIT CSAIL, working with Dr. James Glass. Hongyin focuses on improving the efficiency and transparency of language model reasoning with structured and symbolic inference frameworks.Title:  Reducing Hallucinations in LLMs via Decoding, Detection, and CitationAbstract:  Large language models (LLMs) often produce hallucinations—content that is factually incorrect or unsupported by the real-world facts or input context. This talk presents three approaches that address this challenge from complementary perspectives.  1. DoLa is a decoding method that improves truthfulness by contrasting output distributions from earlier and final transformer layers, leveraging observations of the layer-wise localization of factual knowledge.https://arxiv.org/abs/2309.03883 2. Lookback Lens detects contextual hallucinations using only the information from the attention maps, and transfers well across tasks and model sizes.https://arxiv.org/abs/2407.07071 3. SelfCite introduces a self-supervised framework for aligning LLMs to generate fine-grained citations, using context ablation to provide a simple but effective reward for the necessity and sufficiency of a citation, achieving great performance comparable to Claude Citations with only an 8B model.https://arxiv.org/abs/2502.09604 Together, these techniques offer lightweight and scalable solutions for improving the factual reliability and verifiability of LLM outputs. Bio:  Yung-Sung Chuang is a fourth-year PhD student at MIT CSAIL, working with Dr. James Glass. His research focuses on improving the reliability and factuality of large language models.Title:  THREAD: Thinking Deeper with Recursive SpawningAbstract:  Large language models have shown impressive capabilities across diverse settings, but still struggle as the length and complexity of the context increases. To address this challenge, we introduce a new framework: Thinking Recursively and Dynamically (ThReaD). THREAD frames model generation as a thread of execution that, based on the context, can run to completion or dynamically spawn new threads in a recursive fashion. By spawning, threads can offload work (e.g., reasoning, retrieving information, analyzing data) to child threads, which only return tokens needed for the parent thread to do its work.  We show significant performance gains with THREAD in the settings of LLM task solving and question answering, where the dynamic threading allows the model to recursively decompose the given task or question into progressively simpler sub-problems that can be solved by separate child threads. In an extension of this work, we also demonstrate how a THREAD-based framework can improve reasoning over videos with vision-language models. Bio:  Philip Schroeder is a PhD student at MIT CSAIL, advised by Dr. Jim Glass, in the Spoken Language Systems Group. His work focuses on advancing the reasoning capabilities of LLMs and VLMs through embodied interaction with external environments, both virtual and real.  TBD

April 17

Embodied Intelligence (EI) Joint Seminar Presentation

Andrew Rouditchenko & Saurabhchand Bhati

MIT CSAIL

Part Of

Embodied Intelligence 2024-2025

4:00P

- 5:00P

Location

45-792 (Schwarzman College of Computing)

51 Vassar Street, Cambridge, MA 02139

Add to Calendar 2025-04-17 16:00:00 2025-04-17 17:00:00 America/New_York Embodied Intelligence (EI) Joint Seminar Presentation There will be a joint presentation this week by two MIT CSAIL members from the Spoken Language Systems group. Title: Giving Sight to Speech ModelsAbstract:  Most speech recognition models only use audio as input, which results in poor performance in noisy conditions. I will present Whisper-Flamingo, a multi-modal model which integrates lip-based visual features into the Whisper speech recognition model with gated cross attention. Whisper-Flamingo outperforms audio-only Whisper on English speech recognition and English-X translation for 6 languages in noisy conditions. I will also present mWhisper-Flamingo, a multilingual extension trained on videos in 9 languages. It uses a novel decoder modality dropout technique which is key for good noisy multilingual performance.Bio:  Andrew Rouditchenko is a PhD candidate at MIT CSAIL, working with Dr. Jim Glass in the Spoken Language Systems Group. His research interests are multi-modal and multilingual speech processing. He completed his MEng and SB at MIT.Title:  Smaller, Stronger, and Duration-scalable Audio LearnersAbstract:  State-space models (SSMs) offer a computationally efficient alternative to Transformers for audio modeling, especially with long inputs. However, they face two main challenges: 1) SSMs underperform compared to Transformer models like the Audio Spectrogram Transformer (AST) in short, 10-second audio tagging tasks. 2) Although Audio SSMs theoretically support long audio inputs, their actual performance with long audio has not been thoroughly evaluated.To address these issues, we introduce Knowledge Distilled Audio SSM (DASS), which leverages knowledge distillation during training. DASS is the first SSM to outperform Transformers on AudioSet, achieving an mAP of 48.9 while reducing the model size by one-third. Additionally, we designed a test called Audio Needle In A Haystack (Audio NIAH). DASS, trained on 10-second clips, successfully identifies sound events in hour-long recordings, while AST struggles with inputs of just 50 seconds. This demonstrates SSMs' superior scalability with longer durations.Bio:  Saurabhchand Bhati is a postdoctoral researcher at CSAIL, MIT. His research interests are unsupervised spoken term discovery, unsupervised representation learning, and multimodal learning. Currently, he is exploring leveraging large language models to improve low-resource speech systems. He received PhD from Johns Hopkins University in 2023 under the guidance of Dr. Najim Dehak and a B.tech and M.tech in Electrical Engineering from IIT Hyderabad, India. TBD

April 10

Embodied Intelligence (EI) Joint Seminar Presentation

Isha Puri & Shannon Shen

MIT CSAIL

Part Of

Embodied Intelligence 2024-2025

4:00P

- 5:00P

Location

Stata Center (Building 32) (32 Vassar Street)

Add to Calendar 2025-04-10 16:00:00 2025-04-10 17:00:00 America/New_York Embodied Intelligence (EI) Joint Seminar Presentation There will be a joint presentation this week by two MIT CSAIL PhD candidates.Title:  A Probabilistic Inference Approach to Inference-Time Scaling of LLMsAbstract:  Large language models (LLMs) have achieved significant performance gains via scaling up model sizes and/or data. However, recent evidence suggests diminishing returns from such approaches, motivating scaling the computation spent at inference time. Existing inference-time scaling methods, usually with reward models, cast the task as a search problem, which tends to be vulnerable to reward hacking as a consequence of approximation errors in reward models. In this paper, we instead cast inference-time scaling as a probabilistic inference task and leverage sampling-based techniques to explore the typical set of the state distribution of a state-space model with an approximate likelihood, rather than optimize for its mode directly. We propose a novel inference-time scaling approach by adapting particle-based Monte Carlo methods to this task. Our empirical evaluation demonstrates that our methods have a 4-16x better scaling rate over our deterministic search counterparts on various challenging mathematical reasoning tasks. Using our approach, we show that Qwen2.5-Math-1.5B-Instruct can surpass GPT-4o accuracy in only 4 rollouts, while Qwen2.5-Math-7B-Instruct scales to o1 level accuracy in only 32 rollouts. Our work not only presents an effective method to inference-time scaling, but also connects the rich literature in probabilistic inference with inference-time scaling of LLMs to develop more robust algorithms in future work. Code, videos, and further information available at https://probabilistic-inference-scaling.github.io/.Bio:  Isha Puri is a PhD student at MIT CSAIL coadvised by Professors Yoon Kim and Marzyeh Ghassemi. She graduated from Harvard University in 2023 with a B.A. in Applied Mathematics and Computer Science, where she was a HBS Technology Innovation Fellow. Isha's work has been published at venues such as NeurIPS, EMNLP, TMLR, and ICML. Her current interests lie in studying how language models can be efficiently integrated into real world workflows via AI Agents in a variety of contexts.Title:  Designing and Evaluating LLM Agents through the lens of Collaborative Effort ScalingAbstract:  Agents powered by large language models have been increasingly used to support end users in real-world tasks like code generation or data analysis. These complex tasks typically require multi-turn long-form collaborations between humans and agents. As human involvement is an innate component in this process, we argue a primary objective of helpful agents is to effectively leverage human effort to improve performance, which we call as collaborative effort scaling. Unlike existing outcome-based agent benchmarks, we propose two new dimensions to evaluate the collaboration process inspired by case studies from five domains: (1) scalability - how agents continuously improve with additional human involvement, and (2) feasibility - how much human effort agents can utilize before users stop the interaction. We conduct controlled simulation experiments on tabular analysis and travel planning agents, and they reveal limitations in existing agent systems' ability to leverage iterative feedback. Our analysis shows that the evaluation framework can provide helpful insights for both agent developers seeking to optimize collaborative capabilities and end users selecting appropriate interaction strategies with agents.Bio:  Shannon Shen is currently a 3rd year PhD student at CSAIL, advised by Prof. David Sontag. His research lies at the intersection between NLP and HCI. Right now he focuses on understanding human and LLM agent collaboration and builds methods to support that, e.g., improving the verifiability of LLM generations. He has also developed impactful software for document-parsing that has been downloaded more than 5 million times and won best paper awards at EMNLP and ACL. TBD

April 03

Embodied Intelligence (EI) Joint Seminar Presentation

Xiaolin Fang & Aditya Agarwal

MIT CSAIL

Part Of

Embodied Intelligence 2024-2025

4:00P

- 5:00P

Location

45-792 (Schwarzman College of Computing)

51 Vassar Street, Cambridge, MA 02139

Add to Calendar 2025-04-03 16:00:00 2025-04-03 17:00:00 America/New_York Embodied Intelligence (EI) Joint Seminar Presentation There will be a joint presentation this week by two MIT CSAIL PhD candidates working with Profs. Leslie Kaelbling and Tomás Lozano-Pérez.Title:  KALM: Keypoint Abstraction using Large Models for Object-Relative Imitation LearningPresenter:  Xiaolin FangAbstract: How can robots pick up skills that generalize across diverse objects and environments with few examples, so that we can scale up robot learning to a wide range of tasks? In this talk, I'll introduce KALM, a framework that uses pre-trained vision-language models to automatically generate task-relevant and consistent keypoints, which can be used to guide a diffusion action model. KALM enables robots to learn keypoint-conditioned policies that generalize across object poses, camera views, and new instances with only 5 to 10 demonstrations. I’ll discuss key findings and real-world results that demonstrate how this approach can lead to more scalable and generalizable robot learning.Bio:Xiaolin Fang is a PhD candidate at MIT CSAIL, working with Leslie Kaelbling and Tomás Lozano-Pérez on robot learning and planning. With an emphasis on generalization to new goals and novel environments, especially for long-horizon manipulation tasks, she explores approaches that integrate foundation models, generative learning, and structured planning to develop generalizable robot skills. Previously, she has spent time working with Prof. Dieter Fox at NVIDIA Robotics Lab. Title:  How Robots Learn to See — Building open-world 3D scene representations for Robot PerceptionPresenter:  Aditya Agarwal Abstract: Robots operating in unstructured, everyday environments must be able to perceive and reason about the world beyond what they can directly observe. Achieving robust manipulation in such settings require a rich, complete understanding of complex 3D scenes from sparse and noisy sensory inputs. In this talk, I will introduce SceneComplete, a system that composes general-purpose foundation models for open-world 3D scene completion to support reliable grasping and manipulation. I will highlight the challenges in building such a system for real-world environments and discuss future directions towards general-purpose robotic perception in open-world settings. Bio: Aditya Agarwal is a 2nd year PhD candidate at MIT CSAIL, working with professors Leslie Kaelbling and Tomás Lozano-Pérez on robot learning and perception. His research aims to bridge the gap between perception and robot learning to enable general-purpose robot manipulation by integrating foundation models, generative modelling, and representation learning. He has completed his master's from IIIT Hyderabad and has spent time at Mila Institute Canada and Microsoft Research Labs India. TBD

March 20

Multimodal Learning from the Bottom Up

Andrew Owens

University of Michigan

Part Of

Embodied Intelligence 2024-2025

4:00P

- 5:00P

Location

45-792 (Schwarzman College of Computing)

51 Vassar Street, Cambridge, MA 02139

Add to Calendar 2025-03-20 16:00:00 2025-03-20 17:00:00 America/New_York Multimodal Learning from the Bottom Up Abstract: Today's machine perception systems rely extensively on human-provided supervision, such as language. I will talk about our efforts to develop systems that instead learn directly about the world from unlabeled multimodal signals, bypassing the need for this supervision. First, I will discuss our work on creating models that learn from analyzing unlabeled videos, particularly self-supervised approaches for learning space-time correspondence. Next, I will present models that learn from the paired audio and visual signals that naturally occur in video, including methods for generating soundtracks for silent videos. I will also discuss methods for capturing and learning from paired visual and tactile signals, such as models that augment visual 3D reconstructions with touch. Finally, I will talk about work that explores the limits of pretrained text-to-image generation models by using them to create visual illusions. Bio: Andrew Owens is an assistant professor at The University of Michigan in the department of Electrical Engineering and Computer Science. Prior to that, he was a postdoctoral scholar at UC Berkeley, and he obtained a Ph.D. in computer science from MIT in 2016. He is a recipient of a Sloan Research Fellowship, an NSF CAREER Award, and a Computer Vision and Pattern Recognition (CVPR) Best Paper Honorable Mention Award. TBD

March 13

Embodied Intelligence (EI) Joint Seminar Presentation

Liming Wang & Jehanzeb Mirza

MIT CSAIL

Part Of

Embodied Intelligence 2024-2025

4:00P

- 5:00P

Location

Stata Center (Building 32) (32 Vassar Street)

Add to Calendar 2025-03-13 16:00:00 2025-03-13 17:00:00 America/New_York Embodied Intelligence (EI) Joint Seminar Presentation There will be a joint presentation this week by two postdocs with the Spoken Language Systems Group. Title: Can Diffusion Model Disentangle? A Theoretical PerspectivePresenter:  Liming WangAbstract:  This talk presents a novel theoretical framework for understanding how diffusion models can learn disentangled representations. Within this framework, we establish identifiability conditions for general disentangled latent variable models, analyze training dynamics, and derive sample complexity bounds for disentangled latent subspace models. To validate our theory, we conduct disentanglement experiments across diverse tasks and modalities, including subspace recovery in latent subspace Gaussian mixture models, image colorization, image denoising, and voice conversion for speech classification. Additionally, our experiments show that training strategies inspired by our theory, such as style guidance regularization, consistently enhance disentanglement performance.Bio:  Liming Wang is a postdoctoral associate in the Spoken Language Systems Group at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). His research interests broadly encompass the practical and theoretical aspects of self-supervised speech processing and multimodal learning, with the goal of improving accessibility and inclusivity of speech and language technology.Title: GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language ModelsPresenter: Jehanzeb MirzaAbstract:  In this talk, Jehanzeb Mirza will present GLOV, a framework that reduces the manual effort required to craft effective natural language prompts for vision-language models (VLMs). Instead of human intervention, GLOV employs large language models (LLMs) as implicit optimizers, iteratively refining VLM prompts by ranking and optimizing them based on task performance. Additionally, we guide the LLM’s generation by incorporating an embedding space steering vector during autoregressive generation, biasing it toward more effective prompts at each optimization step. We evaluate GLOV across multiple downstream tasks and VLM architectures, demonstrating its strong generalization ability.Bio: Jehanzeb Mirza is a postdoc in the Spoken Language Systems group at MIT CSAIL, advised by James Glass. His research focuses on multi-modal learning, particularly improving fine-grained understanding. He earned his PhD in Computer Science (specializing in computer vision) from TU Graz, Austria, under the supervision of Prof. Horst Bischof, and his Master’s from KIT, Germany.  TBD

March 06

Automating the Search for Artificial Life with Foundation Models

Akarsh Kumar

MIT CSAIL

Part Of

Embodied Intelligence 2024-2025

4:00P

- 5:00P

Location

45-792 (Schwarzman College of Computing)

51 Vassar Street, Cambridge, MA 02139

Add to Calendar 2025-03-06 16:00:00 2025-03-06 17:00:00 America/New_York Automating the Search for Artificial Life with Foundation Models Abstract:With the recent Nobel Prize awarded for radical advances in protein discovery, foundation models (FMs) for exploring large combinatorial spaces promise to revolutionize many scientific fields. Artificial Life (ALife) has not yet integrated FMs, thus presenting a major opportunity for the field to alleviate the historical burden of relying chiefly on manual design and trial-and-error to discover the configurations of lifelike simulations. This paper presents, for the first time, a successful realization of this opportunity using vision-language FMs. The proposed approach, called Automated Search for Artificial Life (ASAL), (1) finds simulations that produce target phenomena, (2) discovers simulations that generate temporally open-ended novelty, and (3) illuminates an entire space of interestingly diverse simulations. Because of the generality of FMs, ASAL works effectively across a diverse range of ALife substrates including Boids, Particle Life, Game of Life, Lenia, and Neural Cellular Automata. A major result highlighting the potential of this technique is the discovery of previously unseen Lenia and Boids lifeforms, as well as cellular automata that are open-ended like Conway's Game of Life. Additionally, the use of FMs allows for the quantification of previously qualitative phenomena in a human-aligned way. This new paradigm promises to accelerate ALife research beyond what is possible through human ingenuity alone. Bio:Akarsh Kumar is a third-year PhD student at MIT, advised by Phillip Isola. His research explores how principles from natural evolution can drive emergent, open-ended intelligence. He earned his bachelor's degree from UT Austin and has also worked as a research scientist intern at Sakana AI. TBD

February 27

The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities

Zhaofeng Wu

MIT CSAIL

Part Of

Embodied Intelligence 2024-2025

4:00P

- 5:00P

Location

Stata Center (Building 32) (32 Vassar Street)

Add to Calendar 2025-02-27 16:00:00 2025-02-27 17:00:00 America/New_York The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities Abstract: Modern language models can process inputs across diverse languages and modalities. We hypothesize that models acquire this capability through learning a shared representation space across heterogeneous data types (e.g., different languages and modalities), which places semantically similar inputs near one another, even if they are from different modalities/languages. We term this the semantic hub hypothesis, following the hub-and-spoke model from neuroscience (Patterson et al., 2007) which posits that semantic knowledge in the human brain is organized through a transmodal semantic "hub" which integrates information from various modality-specific "spokes" regions. We first show that model representations for semantically equivalent inputs in different languages are similar in the intermediate layers, and that this space can be interpreted using the model's dominant pretraining language via the logit lens. This tendency extends to other data types, including arithmetic expressions, code, and visual/audio inputs. Interventions in the shared representation space in one data type also predictably affect model outputs in other data types, suggesting that this shared representations space is not simply a vestigial byproduct of large-scale training on broad data, but something that is actively utilized by the model during input processing.Bio: Zhaofeng Wu is a third-year NLP PhD student at MIT, working with Professor Yoon Kim. He has worked on model evaluation, analysis, and interpretability, with a recent interest in multilinguality. He obtained his bachelor's and master's degrees from the University of Washington and has also spent time at AI2 and various other companies. TBD

February 20

An Engineer's Foray into Topological Learning: Addressing Challenges in Mobile Robot Perception

Ashis Banerjee

University of Washington

Part Of

Embodied Intelligence 2024-2025

4:00P

- 5:00P

Location

45-792 (Schwarzman College of Computing)

51 Vassar Street, Cambridge, MA 02139

Add to Calendar 2025-02-20 16:00:00 2025-02-20 17:00:00 America/New_York An Engineer's Foray into Topological Learning: Addressing Challenges in Mobile Robot Perception Abstract: Topological learning (TL), referring to a synergy of computational topology and machine learning, has recently emerged as an effective pattern recognition framework for noisy, high-dimensional problems. The recognition happens by first identifying the topological structures that encode the shape and connectedness information among the observations (samples), and then characterizing the structures based on their relative persistence over a wide range of spatial and/or temporal scales. In this talk, I will discuss successful demonstrations of TL for challenging mobile robot visual perception problems in unseen, cluttered environments. Our novel adaptation of TL recognizes occluded objects significantly more accurately than state-of-the-art shape or learning-based methods without requiring real-world training samples. I will conclude by pointing out ongoing and future research directions of TL for high-fidelity semantic mapping and active exploration of such environments.Bio: Ashis G. Banerjee is an Associate Professor of Industrial & Systems Engineering and Mechanical Engineering at the University of Washington (UW). Prior to joining UW, he was a Research Scientist at GE Global Research and a Postdoctoral Associate at MIT. He obtained his Ph.D. and M.S. in Mechanical Engineering from the University of Maryland, College Park, and B.Tech. in Manufacturing Science and Engineering from IIT Kharagpur. He has broad research interests in autonomous robotics and cyber-physical systems. He is an elected Senior Member of the IEEE, and serves as a Senior Editor for IEEE Robotics and Automation Letters. TBD

February 13

Embodied Red Teaming for Auditing Robotic Foundation Models

Part Of

Embodied Intelligence 2024-2025

4:00P

- 5:00P

Location

Stata Center (Building 32) (32 Vassar Street)

Add to Calendar 2025-02-13 16:00:00 2025-02-13 17:00:00 America/New_York Embodied Red Teaming for Auditing Robotic Foundation Models Abstract: Language-conditioned robot models have the potential to enable robots to perform a wide range of tasks based on natural language instructions. However, assessing their safety and effectiveness remains challenging because it is difficult to test all the different ways a single task can be phrased.Current benchmarks have two key limitations: they rely on a limited set of human-generated instructions, missing many challenging cases, and focus only on task performance without assessing safety, such as avoiding damage. To address these gaps, we introduce Embodied Red Teaming (ERT), a new evaluation method that generates diverse and challenging instructions to test these models. ERT uses automated red teaming techniques with Vision Language Models (VLMs) to create contextually grounded, difficult instructions. Experimental results show that state-of-the-art language-conditioned robot models fail or behave unsafely on ERT-generated instructions, underscoring the shortcomings of current benchmarks in evaluating real-world performance and safety. Bio: Zhang-Wei obtained his Ph.D. from Electrical Engineering and Computer Science (EECS) at the Massachusetts Institute of Technology (MIT), advised by Prof. Pulkit Agrawal. His research interest lies in reinforcement learning algorithms and its applications on language models and robotics. In 2024, Zhang-Wei was awarded the Qualcomm Fellowship for North America. He completed both his B.S. and M.S. degrees at National Tsing Hua University. TBD

December 05

Inference-Time Policy Customization Through Interactive Task Specification

Felix Yanwei Wang

MIT CSAIL

Part Of

Embodied Intelligence 2024-2025

4:00P

- 5:00P

Location

(Stata Center - Patil/Kiva Conference Room)

Add to Calendar 2024-12-05 16:00:00 2024-12-05 17:00:00 America/New_York Inference-Time Policy Customization Through Interactive Task Specification Abstract: Imitation learning has driven the development of generalist policies capable of autonomously solving multiple tasks. However, when a pretrained policy makes errors during deployment, there are limited mechanisms for users to customize its behavior. While collecting additional data for fine-tuning can address such issues, doing so for each downstream use case is inefficient at scale. My research proposes an alternative perspective: framing policy errors as task mis-specifications rather than skill deficiencies. By enabling users to specify tasks unambiguously at inference-time, the appropriate skill for a given context can be retrieved without fine-tuning. Specifically, I propose (1) inference-time steering, which leverages human interactions for single-step task specification, and (2) task and motion imitation, which uses symbolic plans for multi-step task specification. These frameworks correct misaligned policy predictions without requiring additional training, maximizing the utility of pretrained models while achieving inference-time user objectives.Bio: Felix Yanwei Wang is a final-year PhD candidate in Electrical Engineering and Computer Science (EECS) at MIT, advised by Prof. Julie Shah. His research focuses on adapting pretrained manipulation policies for human-robot interaction. He earned his Bachelor's degree from Middlebury College and his Master's degree from Northwestern University. He has also worked under the guidance of Prof. Dieter Fox at the NVIDIA Robotics Lab. Felix is a recipient of the MIT Presidential Fellowship and the Work of the Future Fellowship in Generative AI at MIT. His research has been recognized with oral and spotlight presentations at CoRL and ICLR, featured on PBS, and is currently exhibited at the MIT Museum. 32-G449

November 21

Redefining Context for Powerful Test-Time Adaptation Using Unlabeled Data

Sharut Gupta

MIT CSAIL

Part Of

Embodied Intelligence 2024-2025

4:00P

- 5:00P

Location

32-G449 (Stata Center, Kiva / Patil Conference Room)

Add to Calendar 2024-11-21 16:00:00 2024-11-21 17:00:00 America/New_York Redefining Context for Powerful Test-Time Adaptation Using Unlabeled Data Abstract: Foundation models, while powerful, often struggle under distribution shifts in unfamiliar domains, typically requiring costly data collection and retraining to maintain performance. Test-Time Adaptation (TTA) has emerged as a promising approach to address these limitations, enabling models to adapt dynamically to new target domains at test time. In this talk, I will present TTA approaches by rethinking the notion of “context”—an abstract concept drawn from in-context learning—to address two fundamental challenges: improving out-of-distribution generalization and aligning representations with varying task-specific inductive biases, such as fairness constraints. Specifically, we explore two ways of leveraging unsupervised in-context learning, allowing models to use unlabeled data to adapt their behavior flexibly. First, we will demonstrate how using unlabeled domain data as context can align models with diverse distributions, enhancing their robustness in changing environments. Next, we will extend this idea to further improve this alignment by enforcing task-specific inductive priors. Together, these approaches showcase the potential of unsupervised, context-driven TTA to address key challenges of current-generation foundation models. Finally, we will explore the broader implications of this context-driven perspective for building world models, planning, and robust decision-making.Bio: Sharut Gupta is a third-year PhD candidate in Electrical Engineering and Computer Science (EECS) at MIT, advised by Prof. Stefanie Jegelka. Her research interests focus on multi-modal representation learning, robustness, and out-of-distribution generalization. She received her Bachelor’s and Master’s (Dual) degrees from the Indian Institute of Technology Delhi (IIT Delhi), where she completed her thesis research with Prof. Yoshua Bengio on "A Causal Perspective on Efficient Distributed Systems”. Sharut is a recipient of the MIT Presidential Fellowship and has completed research internships at FAIR (Meta AI) and Google DeepMind. 32-G449 (Stata Center, Kiva / Patil Conference Room)

November 14

Recent Progress on Foundation Model Supervision for Robot Learning

Jason Ma

University of Pennsylvania

Part Of

Embodied Intelligence 2024-2025

4:00P

- 5:00P

Location

45-792 (Schwarzman College of Computing)

Add to Calendar 2024-11-14 16:00:00 2024-11-14 17:00:00 America/New_York Recent Progress on Foundation Model Supervision for Robot Learning Abstract: Achieving general-purpose robotics requires robots to quickly learn diverse tasks without extensive training data or hand-engineered controllers for each scenario. While recent efforts in crowd-sourcing robot datasets have expanded available training data, these remain orders of magnitude smaller than datasets used in vision or language foundation models. Rather than solely focusing on scaling robot data, my research develops algorithms that train new and leverage existing foundation models from non-robot domains to provide scalable supervision across diverse robot embodiments, tasks, and policy learning approaches -- in short, enabling robot learning from foundation model supervision. This approach enables automated task learning while bypassing labor-intensive controller design and data collection.In this talk, I will present some recent progress in these directions. First, I will discuss Eurekaverse, a LLM-based environment curriculum generation algorithm that enables acquisition of complex parkour skills in the real world. Second, I will present Generative Value Learning, a new approach for universal value function enabled by long-context VLM in-context learning.Bio: Jason Ma is a final-year PhD student at the University of Pennsylvania. His research interests include foundation models for robotics, robot learning, and reinforcement learning. His work has received Best Paper Finalist at ICRA 2024, Top 10 NVIDIA Research Projects of 2023, and covered by popular media such as the Economist, Fox, Yahoo, and TechCrunch. Jason is supported by Apple Scholar in AI/ML PhD Fellowship as well as OpenAI Superalignment Fellowship. 45-792 (Schwarzman College of Computing)

November 07

The Promises and Pitfalls of Open-source Agent Systems

Tim Dettmers

Carnegie Mellon University / Allen Institute for AI

Part Of

Embodied Intelligence 2024-2025

4:00P

- 5:00P

Location

32-G449 (Stata Center, Patil/Kiva Conference Room)

Add to Calendar 2024-11-07 16:00:00 2024-11-07 17:00:00 America/New_York The Promises and Pitfalls of Open-source Agent Systems Abstract: Agent systems, AI systems that make their own plans and act on them, have shown promising results particularly for coding-changes such as SWE-bench. However, currently, most agent systems rely on closed-source API models such as GPT-4o and Claude as it is believed that open-source models do not have the capabilities to make up successful agent systems. In this talk, I show that agent systems powered by open-source models can match the performance of systems based on GPT-4o. This implies that for good task performance how you use a model is much more important than what model you use. I also discuss problems with agent system generalization and high variability in evaluation that shows we need to be cautious when making scientific claims about agent systems. I will argue that we will need to focus on these generalization and evaluation challenges to make steady scientific progress.Bio: Tim Dettmers is a Research Scientist at the Allen Institute for AI and an Assistant Professor at Carnegie Mellon University. His work focuses on making foundation models, such as ChatGPT, accessible to researchers and practitioners by reducing their resource requirements. His main focus is to develop high-quality agent systems that are open-source and can be run on consumer hardware, such as laptops. His research won oral, spotlight, and best paper awards at conferences such as ICLR and NeurIPS and was awarded the Block Award and Madrona Prize. He created the bitsandbytes open-source library for efficient foundation models, which is growing at 2.2 million installations per month, and for which he received Google Open Source and PyTorch Foundation awards. 32-G449 (Stata Center, Patil/Kiva Conference Room)

October 24

Aligning Language Models with LESS Data and a Simple (SimPO) Objective

Mengzhou Xia

Princeton University

Part Of

Embodied Intelligence 2024-2025

4:00P

- 5:00P

Location

32-G449 (Stata Center, Patil / Kiva Conference Room)

Add to Calendar 2024-10-24 16:00:00 2024-10-24 17:00:00 America/New_York Aligning Language Models with LESS Data and a Simple (SimPO) Objective Abstract:Aligning pre-trained language models ensures they follow human instructions reliably to produce helpful and harmless outputs. Supervised fine-tuning and preference optimization are two key approaches for achieving this goal. In this talk, I will introduce two novel algorithms designed to enhance these two stages. First, I introduce LESS, a model- and optimizer-aware algorithm for data selection. LESS leverages a few curated examples to identify instruction-tuning data that fosters specific capabilities in the model. It avoids relying on surface-form cues by framing data selection as an optimization problem, aiming to minimize the loss on a target dataset (e.g., validation). Our experiments show that training on just 5% of the data selected by LESS outperforms training on the full dataset, with the selected data often transferable across different model sizes and families. Next, I will introduce a simple yet effective algorithm for model alignment, SimPO, which utilizes a reference-free reward formulation based on the average likelihood of model responses. Extensive experiments demonstrate that SimPO outperforms existing offline preference optimization methods, such as DPO, across various settings. Notably, the Gemma2-9B model, tuned with SimPO, achieved the highest rank among <10B models on Chatbot Arena, AlpacaEval 2, and WildBench.Bio:Mengzhou Xia is a final-year PhD student in Computer Science at Princeton University, advised by Danqi Chen. Her research focuses on developing algorithms to build effective language models via data-centric approaches and objective designs under an academic budget. She received her master's degree from Carnegie Mellon University, where she worked with Graham Neubig and her bachelor's degree from Fudan University in China. Mengzhou is a recipient of the 2024 Apple Scholars in AI/ML PhD Fellowship and the 2022 Bloomberg Data Science PhD Fellowship. She has also been awarded as a 2024 MIT EECS Rising Star. Throughout her PhD, she has interned at Meta AI, Microsoft Research, and Bloomberg AI. 32-G449 (Stata Center, Patil / Kiva Conference Room)

October 17

Aligning Robot and Human Representations

Andreea Bobu

MIT Aero Astro / CSAIL

Part Of

Embodied Intelligence 2024-2025

4:00P

- 5:00P

Location

32-G449 (Stata Center, Patil/Kiva Conference Room)

Add to Calendar 2024-10-17 16:00:00 2024-10-17 17:00:00 America/New_York Aligning Robot and Human Representations Abstract: To perform tasks that humans want in the world, robots rely on a representation of salient task features; for example, to hand me a cup of coffee, the robot considers features like efficiency and cup orientation in its behavior. Prior methods try to learn both a representation and a downstream task jointly from data sets of human behavior, but this unfortunately picks up on spurious correlations and results in behaviors that do not generalize. In my view, what’s holding us back from successful human-robot interaction is that human and robot representations are often misaligned: for example, our assistive robot moved a cup inches away from my face -- which is technically collision-free behavior -- because it lacked an understanding of personal space. Instead of treating people as static data sources, my key insight is that robots must engage with humans in an interactive process for finding a shared representation for more efficient, transparent, and seamless downstream learning. In this talk, I focus on a divide and conquer approach: explicitly focus human input on teaching robots good representations before using them for learning downstream tasks. This means that instead of relying on inputs designed to teach the representation implicitly, we have the opportunity to design human input that is explicitly targeted at teaching the representation and can do so efficiently. I introduce a new type of representation-specific input that lets the human teach new features, I enable robots to reason about the uncertainty in their current representation and automatically detect misalignment, and I propose a novel human behavior model to learn robust behaviors on top of human-aligned representations. By explicitly tackling representation alignment, I believe we can ultimately achieve seamless interaction with humans where each agent truly grasps why the other behaves the way they do.Bio: Andreea Bobu is an Assistant Professor at MIT in AeroAstro and CSAIL. She leads the Collaborative Learning and Autonomy Research Lab (CLEAR Lab), where they develop autonomous agents that learn to do tasks for, with, and around people. Her goal is to ensure that these agents' behavior is consistent with human expectations, whether they interact with expert designers or novice users. She obtained her Ph.D. in Electrical Engineering and Computer Science at UC Berkeley with Anca Dragan in 2023. Prior to her Ph.D. she earned her Bachelor’s degree in Computer Science and Engineering from MIT in 2017. She was the recipient of the Apple AI/ML Ph.D. fellowship, is a Rising Star in EECS and an R:SS and HRI Pioneer, and has won best paper award at HRI 2020 and the Emerging Research Award at the International Symposium on the Mathematics of Neuroscience 2023. Before MIT, she was also a Research Scientist at the AI Institute and an intern at NVIDIA in the Robotics Lab. 32-G449 (Stata Center, Patil/Kiva Conference Room)

October 10

On Building General, Zero-Shot Robot Policies

Mahi Shafiullah

NYU Courant Institute

Part Of

Embodied Intelligence 2024-2025

4:00P

- 5:00P

Location

32-G449 (Patil / Kiva Conference Room)

Add to Calendar 2024-10-10 16:00:00 2024-10-10 17:00:00 America/New_York On Building General, Zero-Shot Robot Policies Abstract:Robot models, particularly those trained with large amounts of data, have recently shown a plethora of real-world manipulation and navigation capabilities. Several independent efforts have shown that given sufficient training data in an environment, robot policies can generalize to demonstrated variations in that environment. However, needing to finetune robot models to every new environment stands in stark contrast to models in language or vision that can be deployed zero-shot for open-world problems. In this talk, I will present Robot Utility Models (RUMs), a framework for training and deploying zero-shot robot policies that can directly generalize to new environments without any finetuning. To create RUMs efficiently, we developed new tools to quickly collect data for mobile manipulation tasks, integrate such data into a policy with multi-modal imitation learning, and deploy policies on-device on Hello Robot Stretch, a cheap commodity robot, with an external mLLM verifier for retrying. We trained five such utility models for opening cabinet doors, opening drawers, picking up napkins, picking up paper bags, and reorienting fallen objects. Our system, on average, achieves 90% success rate in unseen, novel environments interacting with unseen objects. Moreover, the utility models can also succeed in different robot and camera set-ups with no further data, training, or fine-tuning. I will talk about our primary lessons from training RUMS: namely the importance of training data over training algorithm and policy class, guidance about data scaling, necessity for diverse yet high-quality demonstrations, and a recipe for robot introspection and retrying to improve performance on individual environments. All the code and data, and models I will talk about have been open sourced in our website, https://robotutilitymodels.com/Bio:Nur Muhammad “Mahi” Shafiullah is a Ph.D. student at NYU Courant Institute advised by Lerrel Pinto. His research is driven by a vision of robots seamlessly integrated into our messy everyday lives: automating problems and continuously learning alongside us. Mahi's recent work has developed new algorithms for learning robotic behavior, large robot models for robust manipulation, and spatio-semantic memory that can handle dynamic changes in the world. He is passionate about getting these models and algorithms out in the real-world, operating autonomously in NYC homes. His work has been featured in Oral and Spotlight presentations and demos at conferences like ICRA, RSS, NeurIPS, ICML, and ICLR. Mahi is supported by the Apple Fellowship, the Jacob T. Schwarz Fellowship, and was a visiting scientist at Meta. In a past life, Mahi was a Silver medalist at IMO and worked on adversarial robustness as an undergrad at MIT (S.B. ‘19). 32-G449 (Patil / Kiva Conference Room)