Our goal is to better understand adversarial examples by 1) bounding the minimum perturbation that needs to be added to a regular input example to cause a given neural network to misclassify it, and 2) generating some adversarial input example with minimum perturbation.

Neural networks have been shown to misclassify – with high confidence – examples only slightly different from correctly classified ones. These misclassified examples are known as adversarial examples, and can serve to explicate flaws in the way that the neural network architecture conceptualizes its decision function. Previous approaches to finding adversarial examples have used stochastic approaches to find an example close to the original but do not allow us to determine the distance to the closest adversarial example. We capitalize on the piecewise-linear nature of popular activation functions used within a neural net, expressing the problem of finding the closest adversarial example as a mixed-integer problem. The advantage of our approach is that we are able to provide formal verification of the performance of a network. The magnitude of the minimum perturbation could also be integrated into the cost function during the training process to improve the robustness of the neural network to adversarial examples.

Research Areas