As infants, we acquire spoken language by listening to those around us, almost as if by osmosis. Of course, we aren't just listening, but looking as well; what we expect to hear depends on what we see, and what we expect to see is influenced by what we hear. We are building unsupervised machine learning models that not only learn to recognize words within speech signals and objects within visual images, but also learn to associate these patterns with one another. Using these models, we are able to perform tasks such as semantic image retrieval with spoken queries, and to uncover translations of words across languages by using the visual space as an interlingua.
If you would like to contact us about our work, please scroll down to the people section and click on one of the group leads' people pages, where you can reach out to them directly.