Many existing approaches to learning semantic parsers require annotated parse trees or are entirely unsupervised. In contrast, children acquire language through interacting with the world around them. We aim to learn language in a manner more similar to children: by distant supervision through captioned videos. Our model is a joint vision-language system that use learns the meaning of words by observing actions from a sentence depicted in a video. This work means more robust parsing that incorporates perception.