Embodied Intelligence (EI) Joint Seminar Presentation

Speaker

Andrew Rouditchenko & Saurabhchand Bhati

MIT CSAIL

Host

Jim Glass

MIT CSAIL

There will be a joint presentation this week by two MIT CSAIL members from the Spoken Language Systems group.

Title: Giving Sight to Speech Models

Abstract: Most speech recognition models only use audio as input, which results in poor performance in noisy conditions. I will present Whisper-Flamingo, a multi-modal model which integrates lip-based visual features into the Whisper speech recognition model with gated cross attention. Whisper-Flamingo outperforms audio-only Whisper on English speech recognition and English-X translation for 6 languages in noisy conditions. I will also present mWhisper-Flamingo, a multilingual extension trained on videos in 9 languages. It uses a novel decoder modality dropout technique which is key for good noisy multilingual performance.

Bio: Andrew Rouditchenko is a PhD candidate at MIT CSAIL, working with Dr. Jim Glass in the Spoken Language Systems Group. His research interests are multi-modal and multilingual speech processing. He completed his MEng and SB at MIT.

Title: Smaller, Stronger, and Duration-scalable Audio Learners

Abstract: State-space models (SSMs) offer a computationally efficient alternative to Transformers for audio modeling, especially with long inputs. However, they face two main challenges: 1) SSMs underperform compared to Transformer models like the Audio Spectrogram Transformer (AST) in short, 10-second audio tagging tasks. 2) Although Audio SSMs theoretically support long audio inputs, their actual performance with long audio has not been thoroughly evaluated.

To address these issues, we introduce Knowledge Distilled Audio SSM (DASS), which leverages knowledge distillation during training. DASS is the first SSM to outperform Transformers on AudioSet, achieving an mAP of 48.9 while reducing the model size by one-third. Additionally, we designed a test called Audio Needle In A Haystack (Audio NIAH). DASS, trained on 10-second clips, successfully identifies sound events in hour-long recordings, while AST struggles with inputs of just 50 seconds. This demonstrates SSMs' superior scalability with longer durations.

Bio: Saurabhchand Bhati is a postdoctoral researcher at CSAIL, MIT. His research interests are unsupervised spoken term discovery, unsupervised representation learning, and multimodal learning. Currently, he is exploring leveraging large language models to improve low-resource speech systems. He received PhD from Johns Hopkins University in 2023 under the guidance of Dr. Najim Dehak and a B.tech and M.tech in Electrical Engineering from IIT Hyderabad, India.

Add to Calendar 2025-04-17 16:00:00 2025-04-17 17:00:00 America/New_York Embodied Intelligence (EI) Joint Seminar Presentation There will be a joint presentation this week by two MIT CSAIL members from the Spoken Language Systems group. Title: Giving Sight to Speech ModelsAbstract:  Most speech recognition models only use audio as input, which results in poor performance in noisy conditions. I will present Whisper-Flamingo, a multi-modal model which integrates lip-based visual features into the Whisper speech recognition model with gated cross attention. Whisper-Flamingo outperforms audio-only Whisper on English speech recognition and English-X translation for 6 languages in noisy conditions. I will also present mWhisper-Flamingo, a multilingual extension trained on videos in 9 languages. It uses a novel decoder modality dropout technique which is key for good noisy multilingual performance.Bio:  Andrew Rouditchenko is a PhD candidate at MIT CSAIL, working with Dr. Jim Glass in the Spoken Language Systems Group. His research interests are multi-modal and multilingual speech processing. He completed his MEng and SB at MIT.Title:  Smaller, Stronger, and Duration-scalable Audio LearnersAbstract:  State-space models (SSMs) offer a computationally efficient alternative to Transformers for audio modeling, especially with long inputs. However, they face two main challenges: 1) SSMs underperform compared to Transformer models like the Audio Spectrogram Transformer (AST) in short, 10-second audio tagging tasks. 2) Although Audio SSMs theoretically support long audio inputs, their actual performance with long audio has not been thoroughly evaluated.To address these issues, we introduce Knowledge Distilled Audio SSM (DASS), which leverages knowledge distillation during training. DASS is the first SSM to outperform Transformers on AudioSet, achieving an mAP of 48.9 while reducing the model size by one-third. Additionally, we designed a test called Audio Needle In A Haystack (Audio NIAH). DASS, trained on 10-second clips, successfully identifies sound events in hour-long recordings, while AST struggles with inputs of just 50 seconds. This demonstrates SSMs' superior scalability with longer durations.Bio:  Saurabhchand Bhati is a postdoctoral researcher at CSAIL, MIT. His research interests are unsupervised spoken term discovery, unsupervised representation learning, and multimodal learning. Currently, he is exploring leveraging large language models to improve low-resource speech systems. He received PhD from Johns Hopkins University in 2023 under the guidance of Dr. Najim Dehak and a B.tech and M.tech in Electrical Engineering from IIT Hyderabad, India. TBD

Organizer & Contact

Marcia Davidson

marcia@csail.mit.edu

617-253-3049

Part of

Embodied Intelligence 2024-2025

Embodied Intelligence (EI) Joint Seminar Presentation

Speaker

Host

April 17 2025

Location

Organizer & Contact

Part of

May 27

Neural Robot Navigation with Foundational and Bio-inspired Models

May 08

DeltaNet and Beyond: The Next Generation of Scalable RNNs

Embodied Intelligence (EI) Joint Seminar Presentation

Speaker

Host

April 17 2025

Location

Organizer & Contact

Part of

Related Events

May 27

Neural Robot Navigation with Foundational and Bio-inspired Models

May 08

DeltaNet and Beyond: The Next Generation of Scalable RNNs