Unlike humans, automatic speech recognizers are not particularly sensitive to contextual information, and are not robust to changes in conditions, such as recording conditions and accents. Our research focuses on developing models that are easily adaptable to the larger context of its application, whether it be the general topic or state of a conversation, or some larger multi-modal context. We explore the following three directions: contextual grounding methods, adaptation/transfer learning, and neural end-to-end models. Grounding aims to connect properties of the speech that a recognizer predicts to a knowledge base, such as a large amount of texts, a list of speakers, a list of topics, a set of noise recordings, or even a set of images. Adaptation deals with change in the knowledge base, such as recording conditions, speakers, topics, dialects, or even languages, and how the speech recognizer should respond to the change. End-to-end training weaves these modules seamlessly together to minimize error propagation and maximize information sharing. Speech is one of its most convenient and efficient means of conveyance, and this research will make speech recognizers better matched to be part of the larger conversational systems of the future.
If you would like to contact us about our work, please scroll down to the people section and click on one of the group leads' people pages, where you can reach out to them directly.