Thesis defense - Automated interpretation of machine learning models
Host
Evan Hernandez
MIT
As machine learning (ML) models are increasingly deployed in production, there’s
a pressing need to ensure their reliability through auditing, debugging, and testing.
Interpretability, the subfield that studies how ML models make decisions, aspires
to meet this need but traditionally relies on human-led experimentation or is based
on human priors about what the model has learned. In this thesis, I propose that
interpretability should evolve alongside ML by adopting automated techniques that
use ML models to interpret ML models. This shift towards automation allows for
more comprehensive analyses of ML models without requiring human scrutiny at
every step, and the effectiveness of these methods should improve as the ML mod-
els themselves become more sophisticated. I present three examples of automated
interpretability approaches: using a captioning model to label features of other mod-
els, manipulating a ML model’s internal representations to predict and correct er-
rors, and identifying simple internal circuits through approximating the ML model
itself. These examples lay the groundwork for future efforts in automating ML model
interpretation.
a pressing need to ensure their reliability through auditing, debugging, and testing.
Interpretability, the subfield that studies how ML models make decisions, aspires
to meet this need but traditionally relies on human-led experimentation or is based
on human priors about what the model has learned. In this thesis, I propose that
interpretability should evolve alongside ML by adopting automated techniques that
use ML models to interpret ML models. This shift towards automation allows for
more comprehensive analyses of ML models without requiring human scrutiny at
every step, and the effectiveness of these methods should improve as the ML mod-
els themselves become more sophisticated. I present three examples of automated
interpretability approaches: using a captioning model to label features of other mod-
els, manipulating a ML model’s internal representations to predict and correct er-
rors, and identifying simple internal circuits through approximating the ML model
itself. These examples lay the groundwork for future efforts in automating ML model
interpretation.