Many existing approaches to learning semantic parsers require annotated parse trees or are entirely unsupervised. In contrast, children acquire language through interacting with the world around them. We aim to learn language in a manner more similar to children: by distant supervision through captioned videos. Our model is a joint vision-language system that use learns the meaning of words by observing actions from a sentence depicted in a video. This work means more robust parsing that incorporates perception.
If you would like to contact us about our work, please scroll down to the people section and click on one of the group leads' people pages, where you can reach out to them directly.