Sometimes it’s easy to forget how good we humans are at understanding our surroundings. Without much thinking, we can describe objects and how they interact with each other.
But this is still a huge problem for machines, which means that teaching computers to predict future actions is incredibly difficult.
Recently, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have worked with IBM to move a step closer in tackling this challenge. The team created a large-scale dataset to help AI systems recognize and understand actions and events from videos.
Called “Moments in Time”, the dataset has over one million labeled three-second videos, with animals, people, and objects that depict dynamic scenes.
While previous datasets have largely focused on specific scenes and actions like “washing hair” or “painting nails”, Moments uses basic actions like “smiling” or “jogging”, focusing on the building blocks necessary for machines to eventually describe more complex events.
For example, the activity “doing a yoga pose” can only be used to describe that particular activity as it is. However, basic actions can describe the different layers of this activity. “Doing a yoga pose” also includes basic actions like “stretching”, “moving”, or “curling.”
“For us humans, it’s easy to recognize that all of these are the same action, despite the fact that visually they look quite different from each other,” says Danny Gutfreund, Video Analytics Scientist at IBM Research AI. “The challenge is to train computer models to do the same. One starting point is identifying the spatial-temporal transformation that’s common to all these scenarios as a means of identifying these patterns. Moments will begin to help us with this and other challenges.”
In the future, the team hopes to make the dataset more diverse, with multiple events in the same three-second scene, as well as including more complicated tasks.