James Glass is a Senior Research Scientist at the Massachusetts Institute of Technology where he leads the Spoken Language Systems Group in the Computer Science and Artificial Intelligence Laboratory. He is also a member of the Harvard University Program in Speech and Hearing Bioscience and Technology. Since obtaining his S.M. and Ph.D. degrees at MIT in Electrical Engineering and Computer Science, his research has focused on automatic speech recognition, unsupervised speech processing, and spoken language understanding using machine learning. He is an IEEE Fellow, and a Fellow of the International Speech Communication Association, and is currently an Associate Editor for the IEEE Transactions on Pattern Analysis and Machine Intelligence.
One of the challenges of processing real-world spoken content, such as automatic speech recognition, is the potential presence of different languages and dialects. Language and Dialect identification can be a useful capability to identify which language is being spoken during a recording.
All humans process vast quantities of unannotated speech and manage to learn phonetic inventories, word boundaries, etc., and can use these abilities to acquire new word. Why can't ASR technology have similar capabilities? Our goal in this research project is to build speech technology using unannotated speech corpora.
Automatic speech recognition (ASR) has been a grand challenge machine learning problem for decades. Our ongoing research in this area examines the use of deep learning models for distant and noisy recording conditions, multilingual, and low-resource scenarios.
The Arabic language is spoken by over one billion people around the world. Arabic presents a variety of challenges for speech and language processing technologies. In our group, we have several research topics examining Arabic, including dialect identification, speech recognition, machine translation, and language processing.
Our main goal is to automatically search for relevant answers among many responses provided for a given question (Answer Selection), and search for relevant questions to reuse their existing answers (Question Retrieval).
Our main goal is to develop fact checking algorithms that can assess the credibility of claims mentioned in the textual statements and provide interpretable valid evidence that explains why a certain claim is considered as factually true or fake.
Generation of sequential data involves multiple factors operating at different temporal scales. Take natural speech for example, the speaker identity tends to be consistent within an utterance, while the phonetic content changes from frame to frame. By explicitly modeling such hierarchical generative process under a probabilistic framework, we proposed a model that learns to factorizes sequence-level factors and sub-sequence-level factors into different sets of representations without any supervision.
Our goal is to explore language representations in computational models. We develop new models for representing natural language and investigate how existing models learn language, focusing on neural network models in key tasks like machine translation and speech recognition.
Neural networks, which learn to perform computational tasks by analyzing huge sets of training data, have been responsible for the most impressive recent advances in artificial intelligence, including speech-recognition and automatic-translation systems.
The butt of jokes as little as 10 years ago, automatic speech recognition is now on the verge of becoming people’s chief means of interacting with their principal computing devices. In anticipation of the age of voice-controlled electronics, MIT researchers have built a low-power chip specialized for automatic speech recognition. Whereas a cellphone running speech-recognition software might require about 1 watt of power, the new chip requires between 0.2 and 10 milliwatts, depending on the number of words it has to recognize.
Speech recognition systems, such as those that convert speech to text on cellphones, are generally the result of machine learning. A computer pores through thousands or even millions of audio files and their transcriptions, and learns which acoustic features correspond to which typed words.But transcribing recordings is costly, time-consuming work, which has limited speech recognition to a small subset of languages spoken in wealthy nations.
For people struggling with obesity, logging calorie counts and other nutritional information at every meal is a proven way to lose weight. The technique does require consistency and accuracy, however, and when it fails, it’s usually because people don't have the time to find and record all the information they need.A few years ago, a team of nutritionists from Tufts University who had been experimenting with mobile-phone apps for recording caloric intake approached members of the Spoken Language Systems Group at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), with the idea of a spoken-language application that would make meal logging even easier.
Every language has its own collection of phonemes, or the basic phonetic units from which spoken words are composed. Depending on how you count, English has somewhere between 35 and 45. Knowing a language’s phonemes can make it much easier for automated systems to learn to interpret speech.In the 2015 volume of Transactions of the Association for Computational Linguistics, CSAIL researchers describe a new machine-learning system that, like several systems before it, can learn to distinguish spoken words. But unlike its predecessors, it can also learn to distinguish lower-level phonetic units, such as syllables and phonemes.
CSAIL’s Spoken Language Systems Group has unveiled a new technique for automatically tracking speakers in audio recordings. The new technique tackles the task of speaker diarization, or computationally determining how many speakers are present in a recording.