Developing Domain-Specific Generative Models
Speaker
Kathleen Lewis
CSAIL MIT
Host
John Guttag
CSAIL MIT
Abstract:
As generative AI research shifts towards large-scale foundation models, careful thought has to go into adapting these models to new domains and tasks. Our work focuses on this issue and proposes novel methods for adapting large-scale generative models to domain-specific tasks. We demonstrate our methods on three specific applications: virtual try-on, conceptual art, and fine-grained object classification. In addition to the technical contributions, my thesis explores broader open questions about domain-specific generative models; for example, how can we carefully construct our training data to mitigate bias? What do human-in-the-loop methods for creative generative AI look like in practice? To what extent are large-scale vision-language models useful for traditionally image-only tasks?
This talk will focus on two of our generative methods, TryOnGAN and GIST. In the TryOnGAN work, we present a modified StyleGAN2 architecture and introduce a layered latent space interpolation method for photorealistic virtual try-on. GIST combines existing foundation models in a novel way to generate image-specific fine-grained text descriptions from image-only datasets. We demonstrate the utility of GIST by fine-tuning vision-language models on the image-and-generated-text pairs to learn an aligned vision-language representation space for improved classification. We evaluate our learned representation space in full-shot and few-shot scenarios across four diverse fine-grained classification datasets and demonstrate state-of-the-art classification performance.
Committee: John Guttag (MIT), Frédo Durand (MIT), Guha Balakrishnan (Rice University), Adrian Dalca (MIT/HMS/MGH)
As generative AI research shifts towards large-scale foundation models, careful thought has to go into adapting these models to new domains and tasks. Our work focuses on this issue and proposes novel methods for adapting large-scale generative models to domain-specific tasks. We demonstrate our methods on three specific applications: virtual try-on, conceptual art, and fine-grained object classification. In addition to the technical contributions, my thesis explores broader open questions about domain-specific generative models; for example, how can we carefully construct our training data to mitigate bias? What do human-in-the-loop methods for creative generative AI look like in practice? To what extent are large-scale vision-language models useful for traditionally image-only tasks?
This talk will focus on two of our generative methods, TryOnGAN and GIST. In the TryOnGAN work, we present a modified StyleGAN2 architecture and introduce a layered latent space interpolation method for photorealistic virtual try-on. GIST combines existing foundation models in a novel way to generate image-specific fine-grained text descriptions from image-only datasets. We demonstrate the utility of GIST by fine-tuning vision-language models on the image-and-generated-text pairs to learn an aligned vision-language representation space for improved classification. We evaluate our learned representation space in full-shot and few-shot scenarios across four diverse fine-grained classification datasets and demonstrate state-of-the-art classification performance.
Committee: John Guttag (MIT), Frédo Durand (MIT), Guha Balakrishnan (Rice University), Adrian Dalca (MIT/HMS/MGH)