Thesis defense - Automated interpretation of machine learning models

Speaker

MIT

Host

Evan Hernandez
MIT
As machine learning (ML) models are increasingly deployed in production, there’s
a pressing need to ensure their reliability through auditing, debugging, and testing.
Interpretability, the subfield that studies how ML models make decisions, aspires
to meet this need but traditionally relies on human-led experimentation or is based
on human priors about what the model has learned. In this thesis, I propose that
interpretability should evolve alongside ML by adopting automated techniques that
use ML models to interpret ML models. This shift towards automation allows for
more comprehensive analyses of ML models without requiring human scrutiny at

every step, and the effectiveness of these methods should improve as the ML mod-
els themselves become more sophisticated. I present three examples of automated

interpretability approaches: using a captioning model to label features of other mod-
els, manipulating a ML model’s internal representations to predict and correct er-
rors, and identifying simple internal circuits through approximating the ML model

itself. These examples lay the groundwork for future efforts in automating ML model
interpretation.