We aim to create a virtual environment where agents learn to perform human tasks by executing programs. Furthermore, we aim to develop models that can generate such programs from video or text, enabling agents to understand and imitate such activities.

In this project, we are interested in modeling complex activities that occur in atypical household. We propose to use programs, i.e., sequences of atomic actions and interactions, as a high level representation of complex tasks. Programs are interesting because provide a non-ambiguous representation of a task, and allow agents to execute them. However, nowadays, there is no database providing this type of information. Towards this goal, we crowd-source programs for a variety of activities that happen in people’s homes, via a game-like interface used for teaching kids how to code. Using the collected dataset, we show how we can learn to extract programs directly from natural language descriptions or from videos. We then implement the most common atomic (inter)actions in the Unity3D game engine, and use our programs to “drive” an artificial agent to execute tasks in a simulated household environment. Our VirtualHome simulator allows us to create a large activity video dataset with rich ground-truth, enabling training and testing of video understanding models.

Research Areas


Sanja Fidler

Kevin Ra

Marko Boben

Tingwu Wang

Jiaman Li

Shantanu Jain