Embodied Intelligence (EI) Joint Seminar Presentation
Speaker
Host
There will be a joint presentation this week by two MIT CSAIL members from the Spoken Language Systems group.
Title: Giving Sight to Speech Models
Abstract: Most speech recognition models only use audio as input, which results in poor performance in noisy conditions. I will present Whisper-Flamingo, a multi-modal model which integrates lip-based visual features into the Whisper speech recognition model with gated cross attention. Whisper-Flamingo outperforms audio-only Whisper on English speech recognition and English-X translation for 6 languages in noisy conditions. I will also present mWhisper-Flamingo, a multilingual extension trained on videos in 9 languages. It uses a novel decoder modality dropout technique which is key for good noisy multilingual performance.
Bio: Andrew Rouditchenko is a PhD candidate at MIT CSAIL, working with Dr. Jim Glass in the Spoken Language Systems Group. His research interests are multi-modal and multilingual speech processing. He completed his MEng and SB at MIT.
Title: Smaller, Stronger, and Duration-scalable Audio Learners
Abstract: State-space models (SSMs) offer a computationally efficient alternative to Transformers for audio modeling, especially with long inputs. However, they face two main challenges: 1) SSMs underperform compared to Transformer models like the Audio Spectrogram Transformer (AST) in short, 10-second audio tagging tasks. 2) Although Audio SSMs theoretically support long audio inputs, their actual performance with long audio has not been thoroughly evaluated.
To address these issues, we introduce Knowledge Distilled Audio SSM (DASS), which leverages knowledge distillation during training. DASS is the first SSM to outperform Transformers on AudioSet, achieving an mAP of 48.9 while reducing the model size by one-third. Additionally, we designed a test called Audio Needle In A Haystack (Audio NIAH). DASS, trained on 10-second clips, successfully identifies sound events in hour-long recordings, while AST struggles with inputs of just 50 seconds. This demonstrates SSMs' superior scalability with longer durations.
Bio: Saurabhchand Bhati is a postdoctoral researcher at CSAIL, MIT. His research interests are unsupervised spoken term discovery, unsupervised representation learning, and multimodal learning. Currently, he is exploring leveraging large language models to improve low-resource speech systems. He received PhD from Johns Hopkins University in 2023 under the guidance of Dr. Najim Dehak and a B.tech and M.tech in Electrical Engineering from IIT Hyderabad, India.