Batch Normalization Causes Gradient Explosion in Deep Randomly Initialized Networks

Speaker

Greg Yang

Microsoft Research

Host

Govind Ramnarayan, Quanquan Liu, Sitan Chen

MIT CSAIL

Abstract: Batch Normalization (batchnorm) has become a staple in deep learning since its introduction in 2015. The authors conjectured that “Batch Normalization may lead the layer Jacobians to have singular values close to 1” and recent works suggest it benefits optimization by smoothing the optimization landscape during training. We disprove the “Jacobian singular value” conjecture for randomly initialized networks, showing batchnorm causes gradient explosion that is exponential in depth. This implies that at initialization, batchnorm in fact “roughens” the optimization landscape. This explosion empirically prevents one from training relu networks with more than 50 layers without skip connection. We discuss several ways of mitigating this explosion and their relevance in practice.

Add to Calendar 2019-05-01 16:00:00 2019-05-01 17:00:00 America/New_York Batch Normalization Causes Gradient Explosion in Deep Randomly Initialized Networks Abstract: Batch Normalization (batchnorm) has become a staple in deep learning since its introduction in 2015. The authors conjectured that “Batch Normalization may lead the layer Jacobians to have singular values close to 1” and recent works suggest it benefits optimization by smoothing the optimization landscape during training. We disprove the “Jacobian singular value” conjecture for randomly initialized networks, showing batchnorm causes gradient explosion that is exponential in depth. This implies that at initialization, batchnorm in fact “roughens” the optimization landscape. This explosion empirically prevents one from training relu networks with more than 50 layers without skip connection. We discuss several ways of mitigating this explosion and their relevance in practice. 32-G575

Organizer & Contact

Rebecca Yadegar

ryadegar@csail.mit.edu

Part of

Algorithms & Complexity Seminars 2018-2019

Batch Normalization Causes Gradient Explosion in Deep Randomly Initialized Networks

Speaker

Host

May 01 2019

Location

Organizer & Contact

Part of

November 09

Simple and Efficient Algorithm for Parallel Matchings

June 05

The Sample Complexity of Toeplitz Covariance Estimation

Batch Normalization Causes Gradient Explosion in Deep Randomly Initialized Networks

Speaker

Host

May 01 2019

Location

Organizer & Contact

Part of

Related Events

November 09

Simple and Efficient Algorithm for Parallel Matchings

June 05

The Sample Complexity of Toeplitz Covariance Estimation