We work towards a principled understanding of the non-robust nature of deep learning classifiers and build approaches to training reliably robust classifiers.

Deep neural networks were pivotal in a number of recent breakthroughs in computer vision, language translation, and other machine learning tasks. The outstanding accuracy they achieve in benchmarks suggests they perform these tasks at a truly human level. The truth, however, turns out to be more nuanced. A particularly striking phenomenon is the non-robust nature of such networks: it is often possible to perturb a correctly classified input in a way that is imperceptible to humans but causes the network to completely misclassify it. Such “adversarial” inputs not only indicate that neural networks don’t learn as well as we would hope but also pose a serious security problem in their real-world deployment. Our project aims to develop a principled understanding of this phenomenon and ways to mitigate it. Specifically, we use the language of continuous optimization to precisely capture the notion of an adversarial input. This language enables us to state a formal security guarantee we hope to satisfy and let this guarantee drive our classifier training process. While the resulting optimization problems are often theoretically intractable, insights from classic optimization theory as well as extensive empirical examinations suggest that they can be reliably solved in practice. Applying these techniques and insights has already enabled us to train robust classifiers for some of the standard computer vision datasets. The resulting models seem to be secure against a wide range of popular attacks. We now work towards achieving the same level of robustness for more complex datasets as well as developing efficient algorithms for training such networks.