Speech Generation and Sound Understanding in Era of Large Language Models
David Harwath
University of Texas, Austin
Add to Calendar
2025-05-01 16:00:00
2025-05-01 17:00:00
America/New_York
Speech Generation and Sound Understanding in Era of Large Language Models
Abstract:Transformer-based large language models (LLMs) have rapidly risen to dominance in the NLP field. One of the most exciting developments in this line of research is the finding that LLMs can be easily extended to handle multimodal inputs, such as vision or speech, via tokenization and concatenation with natural language inputs. In this talk, I will discuss several of my group's recent research directions into expanding the capabilities of multimodal LLMs to process speech and spatial audio signals. In the first half of my talk, I will present my group’s work on VoiceCraft, a neural codec language model which is capable of performing targeted edits of speech recordings where words can be arbitrarily inserted, deleted, or substituted in the waveform itself. These edits preserve the speaker’s voice, prosody, and speaking style, while leaving the non-edited regions of the waveform completely intact. Subjective human evaluations indicate that the naturalness of the edited speech is approximately on par with that of the un-edited speech, and when used for voice-cloning TTS our model outperforms commercial models such as VALL-E and XTTS-v2. In the second half of my talk, I will discuss our recent work on spatial sound understanding. Sound event localization and detection is a classic task in the speech and audio community, and involves predicting the class of a sound source as well as localizing it (e.g. predicting the direction of arrival). We extend this task to encompass higher-level reasoning about multiple sources within a physical environment by proposing the SpatialSoundQA dataset. This dataset contains over 800,000 ambisonic waveforms and accompanying question-answer pairs, and evaluates models on their ability to answer natural language questions such as “Is the sound of the telephone further to the left than the sound of the barking dog?” I will also describe our BAT model, an extension of the LLaMA LLM that is capable of taking spatial audio recordings as input and reasoning about them using natural language. Bio:David Harwath is an assistant professor in the computer science department at UT Austin, where he leads the Speech, Audio, and Language Technologies (SALT) Lab. His group's research focuses on developing novel machine learning methods applied to speech, audio, and multimodal data for tasks such as automatic speech recognition, text to speech synthesis, and acoustic scene analysis. He has received the NSF CAREER award (2023), an ASRU best paper nomination (2015), and was awarded the 2018 George M. Sprowls Award for best computer science PhD thesis at MIT. He holds a B.S. in electrical engineering from UIUC (2010), a S.M. in computer science from MIT (2013), and a Ph.D. in computer science from MIT (2018).
TBD