Moments is a large-scale human-annotated dataset of ~ 1 million (and growing) labelled videos corresponding to real-world actions, motions and events unfolding within three seconds.

Modeling the spatial-audio-temporal dynamics even for atomic actions occurring in three second videos, poses daunting challenges: many meaningful events do not include only people, but also objects, animals, and nature; visual and auditory events can be symmetrical or not in time ("opening" means "closing" in reverse order), and transient or sustained. This project requires artificial systems to jointly learn three modalities: spatial, temporal and auditory, for recognizing the activities at human level, predicting future activities and sequences of actions and understanding causal action and agent relationships. Moments, designed to have a large coverage and diversity of events in both visual and auditory modalities, can serve as a new challenge to develop models that can appropriately scale to the level of complexity and abstract reasoning that a human processes on a daily basis.