Building Adversarially Robust Natural Language Processing Systems


Robin Jia
Stanford University


Regina Barzilay
While modern NLP systems have achieved outstanding performance on static benchmarks, they often fail catastrophically when presented with inputs that have been adversarially perturbed. This lack of robustness exposes troubling gaps in current models’ understanding capabilities. Moreover, vulnerability to adversarial perturbations poses challenges for deployment of NLP systems for tasks like hate speech detection.

In this talk, I will discuss methods for building models that are provably robust to certain forms of adversarial perturbations. First, I will show how to train NLP models to produce certificates of robustness--guarantees that for a given example and combinatorially large class of possible perturbations, no perturbation can cause a misclassification. Second, I will describe task-agnostic robust encodings (TARE), a method to construct discrete, task-general sentence representations that confer robustness to any downstream model that uses them. These approaches have complementary strengths: certificates scale well to large inputs, while TARE accommodates arbitrary model architectures and works for many tasks at once. Importantly, both approaches underscore how optimizing for robustness leads to new trade-offs when building NLP models.

Robin Jia is a sixth-year PhD student at Stanford University, advised by Percy Liang. His research interests lie broadly in building natural language processing systems that are robust when given unexpected test-time inputs. His work has received an Outstanding Paper Award at EMNLP 2017 and a Best Short Paper Award at ACL 2018. Robin has been supported by an NSF graduate research fellowship.