THESIS DEFENSE: Towards a Unified Framework for Visual Recognition and Generation via Masked Generative Modeling


Tianhong Li


Dina Katabi
Recognition and generation are two key tasks in computer vision. However, recognition and generative models are typically trained independently, which ignores the complementary nature of the two tasks.
In this thesis, we present a unified framework for visual data recognition and generation via masked generative modeling, and further demonstrate its superior power to address challenges across various applications.
We will begin with MAGE, a novel framework that unifies image generation and recognition while achieving state-of-the-art performance on both tasks. We then extend it into vision-language multi-modal training through ITIT, which utilizes unpaired image and text data to train models capable of high-quality, bidirectional image-text generation -- the recognition power enables accurate image-to-text captioning, while the generation power enables realistic text-to-image generation. Moreover, inspired by the synergy between image generation and recognition observed in MAGE, we introduce RCG, a framework that enhances the quality of unconditional image generation to the same level of class-conditional generation, by using representations learned in a self-supervised manner to guide the generative process. Lastly, we introduce Reparo to address the challenge of packet loss in video conferencing with the help of masked generative modeling, enabling the reconstruction of lost video data without traditional error correction methods. This ensures high-quality communication even under conditions of substantial data loss. These works demonstrate the power of the proposed unified framework, to not only push forward the state-of-the-art in individual downstream applications but also to provide robust, versatile solutions adaptable to a wide range of real-world problems in computer vision and beyond.

Tianhong Li is a PhD Candidate in the EECS Department at MIT, advised by Prof. Dina Katabi. His recent research interests lie in generative models and representation learning, as well as vision-language models. More generally, Tianhong is interested in building a unified and general visual foundation model that can understand the world beyond human perceptions and intelligence. He was awarded the Mathworks Fellowship in 2023. Previously, he received a BS in Computer Science from Yao Class, Tsinghua University.

Committee members:
Dina Katabi (thesis supervisor), Mohammad Alizadeh, Phillip Isola, Kaiming He