We aim to create a virtual environment where agents learn to perform human tasks by executing programs. Furthermore, we aim to develop models that can generate such programs from video or text, enabling agents to understand and imitate such activities.