THESIS DEFENSE: Language Modeling from Visually Grounded Speech

Speaker

Cheng-I (Jeff) Lai
MIT CSAIL

Host

Jim Glass
MIT CSAIL

Abstract:

Recent advancements in spoken language processing have significantly reduced automatic speech recognition (ASR) error rates, driven by large-scale supervised training on paired speech–text data and, more recently, self-supervised pre-training on unpaired speech and audio. These methods have facilitated robust transfer learning across diverse speech and audio tasks. However, fully leveraging multimodal inputs, particularly visual context, remains underexplored. This thesis addresses this gap by developing novel language modelingtechniques directly from visually grounded speech.

We first introduce the Audio-Visual Neural Syntax Learner (AV-NSL), an unsupervised parser that recovers constituency trees directly from raw speech paired with images, demonstrating how visual context effectively bootstraps grammar induction without textual supervision. Next, we investigate Audio-Visual Word Discovery for Speech Translation, using the Fisher Spanish–English corpus to train a series of speech-to-speech translation models based on pseudo-word units discovered via audio-visual grounding. This study highlights that simplistic acoustic tokens and limited training data degrade re-synthesis and translation quality, underscoring two crucial missing ingredients: richer semantic tokens and large-scale training. Guided by these insights, we present Audio-Visual Gemma (AV-Gemma), a family of multimodal foundation models that condition jointly on images and learned semantic speech tokens. At scale, AV-Gemma generates visually coherent spoken captions and transfers robustly to tasks such as video-to-speech generation and spoken visual question answering, significantly advancing multimodal spoken-language processing.

Advisor:  Jim Glass

Thesis Committee:  Jacob Andreas, Yoon Kim