Beyond Empirical Risk Minimization: the Lessons of Deep Learning

Speaker

The Ohio State University

Abstract: "A model with zero training error is overfit to the training data and will typically generalize poorly" goes statistical textbook wisdom. Yet, in modern practice, over-parametrized deep networks with near perfect fit on training data still show excellent test performance. This apparent contradiction points to troubling cracks in the conceptual foundations of machine learning. While classical analyses of Empirical Risk Minimization rely on balancing the complexity of predictors with training error, modern models are best described by interpolation. In that paradigm a predictor is chosen by minimizing (explicitly or implicitly) a norm corresponding to a certain inductive bias over a space of functions that fit the training data exactly. I will discuss the nature of the challenge to our understanding of machine learning and point the way forward to first analyses that account for the empirically observed phenomena. Furthermore, I will show how classical and modern models can be unified within a single "double descent" risk curve, which subsumes the classical U-shaped bias-variance trade-off.

Finally, as an example of a particularly interesting inductive bias, I will show evidence that deep over-parametrized autoencoders networks, trained with SGD, implement a form of associative memory with training examples as attractor states.

Add to Calendar 2019-10-28 16:00:00 2019-10-28 17:00:00 America/New_York Beyond Empirical Risk Minimization: the Lessons of Deep Learning Abstract: "A model with zero training error is overfit to the training data and will typically generalize poorly" goes statistical textbook wisdom. Yet, in modern practice, over-parametrized deep networks with near perfect fit on training data still show excellent test performance. This apparent contradiction points to troubling cracks in the conceptual foundations of machine learning. While classical analyses of Empirical Risk Minimization rely on balancing the complexity of predictors with training error, modern models are best described by interpolation. In that paradigm a predictor is chosen by minimizing (explicitly or implicitly) a norm corresponding to a certain inductive bias over a space of functions that fit the training data exactly. I will discuss the nature of the challenge to our understanding of machine learning and point the way forward to first analyses that account for the empirically observed phenomena. Furthermore, I will show how classical and modern models can be unified within a single "double descent" risk curve, which subsumes the classical U-shaped bias-variance trade-off.Finally, as an example of a particularly interesting inductive bias, I will show evidence that deep over-parametrized autoencoders networks, trained with SGD, implement a form of associative memory with training examples as attractor states. 46-3002

Organizer & Contact

Kathleen Sullivan

kdsulliv@csail.mit.edu

Part of

Brains, Minds and Machines Seminar Series 2019 - 2020

Beyond Empirical Risk Minimization: the Lessons of Deep Learning

Speaker

October 28 2019

Location

Organizer & Contact

Part of

October 29

Calibrating Generative Models: The Probabilistic Chomsky-Schützenberger Hierarchy

November 05

Feedforward and feedback processes in visual recognition

Beyond Empirical Risk Minimization: the Lessons of Deep Learning

Speaker

October 28 2019

Location

Organizer & Contact

Part of

Related Events

October 29

Calibrating Generative Models: The Probabilistic Chomsky-Schützenberger Hierarchy

November 05

Feedforward and feedback processes in visual recognition