Embodied Intelligence (EI) Joint Seminar Presentation

Speaker

Andrew Rouditchenko & Saurabhchand Bhati
MIT CSAIL

Host

Jim Glass
MIT CSAIL

There will be a joint presentation this week by two MIT CSAIL members from the Spoken Language Systems group. 

Title: Giving Sight to Speech Models

Abstract:  Most speech recognition models only use audio as input, which results in poor performance in noisy conditions. I will present Whisper-Flamingo, a multi-modal model which integrates lip-based visual features into the Whisper speech recognition model with gated cross attention. Whisper-Flamingo outperforms audio-only Whisper on English speech recognition and English-X translation for 6 languages in noisy conditions. I will also present mWhisper-Flamingo, a multilingual extension trained on videos in 9 languages. It uses a novel decoder modality dropout technique which is key for good noisy multilingual performance.

Bio:  Andrew Rouditchenko is a PhD candidate at MIT CSAIL, working with Dr. Jim Glass in the Spoken Language Systems Group. His research interests are multi-modal and multilingual speech processing. He completed his MEng and SB at MIT.


Title:  Smaller, Stronger, and Duration-scalable Audio Learners

Abstract:  State-space models (SSMs) offer a computationally efficient alternative to Transformers for audio modeling, especially with long inputs. However, they face two main challenges: 1) SSMs underperform compared to Transformer models like the Audio Spectrogram Transformer (AST) in short, 10-second audio tagging tasks. 2) Although Audio SSMs theoretically support long audio inputs, their actual performance with long audio has not been thoroughly evaluated.

To address these issues, we introduce Knowledge Distilled Audio SSM (DASS), which leverages knowledge distillation during training. DASS is the first SSM to outperform Transformers on AudioSet, achieving an mAP of 48.9 while reducing the model size by one-third. Additionally, we designed a test called Audio Needle In A Haystack (Audio NIAH). DASS, trained on 10-second clips, successfully identifies sound events in hour-long recordings, while AST struggles with inputs of just 50 seconds. This demonstrates SSMs' superior scalability with longer durations.

Bio:  Saurabhchand Bhati is a postdoctoral researcher at CSAIL, MIT. His research interests are unsupervised spoken term discovery, unsupervised representation learning, and multimodal learning. Currently, he is exploring leveraging large language models to improve low-resource speech systems. He received PhD from Johns Hopkins University in 2023 under the guidance of Dr. Najim Dehak and a B.tech and M.tech in Electrical Engineering from IIT Hyderabad, India.