Interpretability in complex machine learning models
Our goal is to develop methods that can "explain" the behavior of complex machine learning models, without restricting their power. We seek explanations that are simple, robust and grounded in statistical analysis of the model's behavior.
Modern machine learning models are highly flexible but lack transparency. Can we devise methods to explain the predictions of such models, without restricting their expressiveness? Can we do so even if we don't know anything about their architecture, i.e., if they are "black-boxes"? In this project, we are developing methods for explaining the predictions made rather than constraining the models themselves to be interpretable. We are particularly interested in providing explanations for the predictions of complex machine learning models that operate on structured data, such as sentences, trees or graphs. For example, we use statistical input-output analysis to learn to interpret predictions of sequence-to-sequence models, such as those used in machine translation and dialogue systems.