In order to read emotions, understand actions or anticipate intentions, humans need efficient ways of gathering information about each other. In particular, gaze and speech are rich sources of information about other peoples’ thoughts and this thesis investigates these modes. In the first part of the thesis, we describe our work on predicting human gaze. We introduce a series of methods to follow gaze for different modalities. First, we present GazeFollow, a dataset and model to predict the location people’s gaze in an image. We then extend this method to work on video, where the system predicts when and where in the video the attended object appears. Finally, we introduce Gaze360, a large-scale gaze-tracking dataset and method for robust 3D gaze direction estimation in unconstrained scenes. In order to improve processing efficiency, we also propose a saliency-based sam- pling layer designed to improve performance in arbitrary tasks by efficiently zooming into the relevant parts of the input image. In the second part of the thesis, we present our work on learning spoken words from raw audio descriptions of images. We describe a multi- modal system capable of learning correspondences between segments of audio - nouns - and specific visual concepts. To investigate how to extend this system beyond learning nouns, we present a novel training procedure to learn abstract visual attributes (i.e., size, material or color) by using a generative model to generate the training images. Building upon recent findings that GAN representations can be manipulated to edit semantic concepts in the generated output, our method uses GAN-generated images to train the model using a triplet loss.
Adrià Recasens is a PhD student in computer vision at the Computer Science and Artificial Intelligence Laboratory (CSAIL) of the Massachusetts Institue of Technology advised by Professor Antonio Torralba. His research interests range on various topics in computer vision and machine learning. Among other things, he is working on gaze estimation, predicting where people are looking in images. Before starting his PhD, he completed a double degree in Mathematics and Telecommunications at the Centre de Formació Interdisciplinària Superior of the Politechnical University of Catalunya, BarcelonaTech . While finishing his double degree, he collaborated with the Mobile Experience Laboratory at MIT and the LARCA group at UPC BarcelonaTech.