Antonio Torralba is a Delta Electronics Professor of Electrical Engineering and Computer Science at the Massachusetts Institute of Technology (MIT) and Head of the AI+D faculty in the EECS department. He received a degree in telecommunications engineering from Telecom BCN, Spain in 1994 and a Ph.D. degree in signal, image, and speech processing from the Institut National Polytechnique de Grenoble, France in 2000. From 2000 to 2005, he spent postdoctoral training at the Brain and Cognitive Sciences Department and the Computer Science and Artificial Intelligence Laboratory, MIT, where he is now a professor.
We aim to create a virtual environment where agents learn to perform human tasks by executing programs. Furthermore, we aim to develop models that can generate such programs from video or text, enabling agents to understand and imitate such activities.
Our goal is to build a system that predicts where people are looking in images. Given an image and the location of a head, our approach follows the gaze of the person and identifies the object being looked at.
We aim to understand 3D object structure from a single image. We propose an end-to-end framework which sequentially estimates 2D keypoint heatmaps and 3D object structure, by training it on both real 2D-annotated images and synthetic 3D data and by integrating a 3D-to-2D projection layer.
Our goal is to understand the illumination of an environment. By disentangling the illumination effect from other intrinsic properties (e.g. geometry, texture, color), we can better understand how human perceive the world. It also has several applications such as single image relighting, color editing, etc.
The shared mission of Visual Computing is to connect images and computation, spanning topics such as image and video generation and analysis, photography, human perception, touch, applied geometry, and more.
Speech recognition systems, such as those that convert speech to text on cellphones, are generally the result of machine learning. A computer pores through thousands or even millions of audio files and their transcriptions, and learns which acoustic features correspond to which typed words.But transcribing recordings is costly, time-consuming work, which has limited speech recognition to a small subset of languages spoken in wealthy nations.