Human cognition is extremely flexible. We can easily adapt to new tasks, generalize about the world around us and use perception to learn from our environment. When we think about models of artificial intelligence, often the opposite is true: the models cannot generalize to novel tasks, they perform in constrained environments, and they are often disconnected from perception. I aim to build models grounded in perception that tackle classical planning problems in AI in a new realm. I use a joint approach that combine techniques in natural language processing and computer vision. Using language and vision, my works aims to observe agents in videos and generate plans for optimal goal accomplishment.