Thesis defense - Automated interpretation of machine learning models

Speaker

MIT

Host

Evan Hernandez

MIT

As machine learning (ML) models are increasingly deployed in production, there’s
a pressing need to ensure their reliability through auditing, debugging, and testing.
Interpretability, the subfield that studies how ML models make decisions, aspires
to meet this need but traditionally relies on human-led experimentation or is based
on human priors about what the model has learned. In this thesis, I propose that
interpretability should evolve alongside ML by adopting automated techniques that
use ML models to interpret ML models. This shift towards automation allows for
more comprehensive analyses of ML models without requiring human scrutiny at

every step, and the effectiveness of these methods should improve as the ML mod-
els themselves become more sophisticated. I present three examples of automated

interpretability approaches: using a captioning model to label features of other mod-
els, manipulating a ML model’s internal representations to predict and correct er-
rors, and identifying simple internal circuits through approximating the ML model

itself. These examples lay the groundwork for future efforts in automating ML model
interpretation.

Add to Calendar 2024-04-23 11:00:00 2024-04-23 13:00:00 America/New_York Thesis defense - Automated interpretation of machine learning models As machine learning (ML) models are increasingly deployed in production, there’sa pressing need to ensure their reliability through auditing, debugging, and testing.Interpretability, the subfield that studies how ML models make decisions, aspiresto meet this need but traditionally relies on human-led experimentation or is basedon human priors about what the model has learned. In this thesis, I propose thatinterpretability should evolve alongside ML by adopting automated techniques thatuse ML models to interpret ML models. This shift towards automation allows formore comprehensive analyses of ML models without requiring human scrutiny atevery step, and the effectiveness of these methods should improve as the ML mod-els themselves become more sophisticated. I present three examples of automatedinterpretability approaches: using a captioning model to label features of other mod-els, manipulating a ML model’s internal representations to predict and correct er-rors, and identifying simple internal circuits through approximating the ML modelitself. These examples lay the groundwork for future efforts in automating ML modelinterpretation. 32-G449

Organizer & Contact

Evan Hernandez

dez@mit.edu

Thesis defense - Automated interpretation of machine learning models

Speaker

Host

April 23 2024

Location

Organizer & Contact