Spatio-temporal Convolutional Networks Explain Neural Representations of Human Action
Spatio-temporal convolutional networks are a good model of how visual cortex represents the actions of others, and thinking about robustness to complex transformations, is key to uncovering how human visual cortex is organized.
The ability to recognize the actions of others is a crucial aspect of human perception. We investigate the computational mechanisms underlying how visual cortex represents human actions. We use a dataset of well-controlled videos of five actions performed by five actors at five viewpoints and magnetoencephalography (MEG) recordings of human subjects viewing these videos. Actions can be decoded from these MEG recordings robustly to changes in viewpoint. We explore variations of Convolutional Neural Network architectures and assess the performance of each model on the same viewpoint invariant action recognition task, as well as how well each model matches the MEG data, using Representational Similarity Analysis. We show that feed-forward spatio-temporal convolutional neural networks (ST-CNNs) perform well on invariant action recognition tasks and account for the majority of the explainable variance in the neural data. Recent advances in comparing computational models and neural recordings revealed that by optimizing artificial systems for performance on discriminative tasks one can obtain accurate models of visual cortex. We broaden the scope of these methods to include video stimuli. Furthermore, we show that performance on a viewpoint invariant action recognition task is predictive of how well a model representation matches human neural data, but performance a non-viewpoint invariant version of the task is non informative. Our results suggest that spatio-temporal convolutional networks are a good model of how visual cortex represents the actions of others and that robustness to complex transformations, is a specific computational goal driving the organization of visual processing in the human brain.