Abstract: Recently, black-box models, such as neural networks, have been increasingly adopted in many tasks. However, their opacity, or the inability to understand their inner-workings, has hindered the deployment in high-stakes domains such as healthcare or finance. In this talk, I describe my research in interpretability and transparency to address this issue.
In the interpretability category, I introduce two fundamental properties of good explanations for model predictions, correctness and understandability. Correctness captures the notion that the explanations should faithfully represent the model’s decision making logic, and understandability reflects the requirement that these explanations should be reliably understood by human users. For both properties, I propose evaluation metrics as well as methods that improve upon existing ones, while identifying avenues for future work.
In the transparency category, I present the transparency-by-example framework, a Bayesian sampling formulation to inspect models and identify a wide range of model behaviors. I demonstrate the flexibility of this Bayesian approach by applying it to both deep neural networks and non-differentiable robot controllers, revealing hidden and hard-to-find insights in both cases.