Thesis Defense, Sung Min (Sam) Park: Title: ML Predictions through the Lens of Data

Speaker

Sung Min (Sam) Park

Host

Committee: Aleksander Madry (Advisor), Ankur Moitra, Antonio Torralba
Abstract: Data plays a critical role in driving the behavior of machine learning models. Yet, understanding precisely how the choice of the training data influences model predictions remains challenging.First, I’ll introduce the datamodeling framework for directly modeling predictions as functions of training data. Despite the complexity of the underlying process (e.g. SGD on deep neural networks), we show that we can accurately predict final model outputs as linear functions of the presence of different training examples.Second, I’ll present TRAK, a much faster method for estimating datamodels based on ideas from influence functions and the empirical NTK. TRAK allows us to reliably scale data attribution to large-scale settings for the first time and generalizes well across various domains (e.g., vision, language modeling, diffusion).Finally, I’ll motivate the problem of unlearning---updating a machine learning model to “forget” a part of its training data---and show how to tackle this challenging problem by leveraging datamodels.Committee: Aleksander Madry (Advisor), Ankur Moitra, Antonio Torralba