Building models that learn spoken language by seeing and hearing
As infants, we acquire spoken language by listening to those around us, almost as if by osmosis. Of course, we aren't just listening, but looking as well; what we expect to hear depends on what we see, and what we expect to see is influenced by what we hear. We are building unsupervised machine learning models that not only learn to recognize words within speech signals and objects within visual images, but also learn to associate these patterns with one another. Using these models, we are able to perform tasks such as semantic image retrieval with spoken queries, and to uncover translations of words across languages by using the visual space as an interlingua.