David Harwath

Research Scientist

I hold a B.S. in electrical and computer engineering at the University of Illinois at Urbana-Champaign, a S.M. in computer science from MIT, and a Ph.D. in computer science from MIT.

My research focus is broadly in the area of speech and language processing. In the past, I have worked on  automatic language identification, latent topic modeling, text summarization, pronunciation modeling for speech recognition, audio-visual speech recognition, and unsupervised speech pattern discovery.

The current focus of my research work is multimodal perception. Human babies possess a unique and incredible ability to learn language by simply being immersed in the world. They learn to utilize spoken language to communicate their feelings, needs, perceptions, and the state of the world to their caretakers and peers. This language is inescapably grounded in the real world, and thus tightly coupled to other sensory modalities such as vision, touch, smell, etc.

I am interested in designing unsupervised learning algorithms that can acquire language and learn to perceive the world in a similarly organic way, without necessarily mimicking the mechanisms by which humans do so. I believe that the cross-modal correspondences that exist in the real world can be leveraged to guide this learning, acting as a surrogate for the expert annotations upon which conventional machine learning models rely.



Unsupervised Speech Processing

All humans process vast quantities of unannotated speech and manage to learn phonetic inventories, word boundaries, etc., and can use these abilities to acquire new word. Why can't ASR technology have similar capabilities? Our goal in this research project is to build speech technology using unannotated speech corpora.
Jim Glass


Automatic Speech Recognition

Automatic speech recognition (ASR) has been a grand challenge machine learning problem for decades. Our ongoing research in this area examines the use of deep learning models for distant and noisy recording conditions, multilingual, and low-resource scenarios.



Learning words from pictures

Speech recognition systems, such as those that convert speech to text on cellphones, are generally the result of machine learning. A computer pores through thousands or even millions of audio files and their transcriptions, and learns which acoustic features correspond to which typed words.But transcribing recordings is costly, time-consuming work, which has limited speech recognition to a small subset of languages spoken in wealthy nations.