All humans process vast quantities of unannotated speech and manage to learn phonetic inventories, word boundaries, etc., and can use these abilities to acquire new word. Why can't ASR technology have similar capabilities? Our goal in this research project is to build speech technology using unannotated speech corpora.

Automatic speech recognition technology is all around us, from the  mobile device in your pocket to the smart speaker system in your living room. While different devices may rely on different speech recognition algorithms or models, one aspect they all have in common is the reliance on human-annotated training data. Thousands of hours of speech need to be manually transcribed at the word level in order to achieve state-of-the-art performance. This kind of training data is so expensive to collect that it only exists for a few dozen of the thousands of languages spoken worldwide.

Unsupervised learning algorithms have the potential to reduce this reliance on manually annotated data and democratize speech recognition technology across the world. In the process, they may also shed light on the mechanisms by which humans are able to acquire spoken language by simply observing the world around them. This project focuses on developing novel, unsupervised models and algorithms for processing speech, from discovering phonemes and words to higher level semantics and understanding.