Automatic speech recognition (ASR) has been a grand challenge machine learning problem for decades. Our ongoing research in this area examines the use of deep learning models for distant and noisy recording conditions, multilingual, and low-resource scenarios.

Unlike humans, automatic speech recognizers are not particularly sensitive to contextual information, and are not robust to changes in conditions, such as recording conditions and accents. Our research focuses on developing models that are easily adaptable to the larger context of its application, whether it be the general topic or state of a conversation, or some larger multi-modal context. We explore the following three directions: contextual grounding methods, adaptation/transfer learning, and neural end-to-end models. Grounding aims to connect properties of the speech that a recognizer predicts to a knowledge base, such as a large amount of texts, a list of speakers, a list of topics, a set of noise recordings, or even a set of images. Adaptation deals with change in the knowledge base, such as recording conditions, speakers, topics, dialects, or even languages, and how the speech recognizer should respond to the change. End-to-end training weaves these modules seamlessly together to minimize error propagation and maximize information sharing. Speech is one of its most convenient and efficient means of conveyance, and this research will make speech recognizers better matched to be part of the larger conversational systems of the future.