Gradient descent optimizes over-parameterized deep ReLU networks

PDF / 754,195 Bytes
26 Pages / 439.37 x 666.142 pts Page_size
16 Downloads / 246 Views

Gradient descent optimizes over-parameterized deep ReLU networks Difan Zou1 · Yuan Cao1 · Dongruo Zhou1 · Quanquan Gu1 Received: 4 May 2019 / Revised: 6 July 2019 / Accepted: 6 September 2019 © The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2019

Abstract We study the problem of training deep fully connected neural networks with Rectified Linear Unit (ReLU) activation function and cross entropy loss function for binary classification using gradient descent. We show that with proper random weight initialization, gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under certain assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by gradient descent produces a sequence of iterates that stay inside a small perturbation region centered at the initial weights, in which the training loss function of the deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of gradient descent. At the core of our proof technique is (1) a milder assumption on the training data; (2) a sharp analysis of the trajectory length for gradient descent; and (3) a finer characterization of the size of the perturbation region. Compared with the concurrent work (Allen-Zhu et al. in A convergence theory for deep learning via over-parameterization, 2018a; Du et al. in Gradient descent finds global minima of deep neural networks, 2018a) along this line, our result relies on milder over-parameterization condition on the neural network width, and enjoys faster global convergence rate of gradient descent for training deep neural networks. Keywords Deep neural networks · Gradient descent · Over-parameterization · Random initialization · Global convergence

D. Zou, Y. Cao: Equal contribution. Editors: Kee-Eung Kim and Jun Zhu.

B

Quanquan Gu [email protected] Difan Zou [email protected] Yuan Cao [email protected] Dongruo Zhou [email protected]

1

Department of Computer Science, University of California, Los Angeles, CA 90095, USA

123

Machine Learning

1 Introduction Deep neural networks have achieved great success in many applications like image processing (Krizhevsky et al. 2012), speech recognition (Hinton et al. 2012) and Go games (Silver et al. 2016). However, the reason why deep networks work well in these fields remains a mystery for long time. Different lines of research try to understand the mechanism of deep neural networks from different aspects. For example, a series of work tries to understand how the expressive power of deep neural networks are related to their architecture, including the width of each layer and depth of the network (Telgarsky 2015, 2016; Lu et al. 2017; Liang and Srikant 2016; Yarotsky 2017, 2018; Hanin 2017; Hanin and Sellke 2017). These work shows that multi-layer networks with wide layers can approximate arbitrary continuous function. Very recently, there emerges a large body of work that study the global convergen

Data Loading...

Gradient descent optimizes over-parameterized deep ReLU networks

Recommend Documents

Fast gradient descent algorithm for image classification with neural networks

Advances in verification of ReLU neural networks

Local Attractors for Gradient-Related Descent Iterations

Nonlinear System Identification Using Neural Networks Trained with Natural Gradient Descent

Pruning Deep Convolutional Neural Networks via Gradient Support Pursuit

Broad Learning System with Proportional-Integral-Differential Gradient Descent

Single-View 3D Shape Reconstruction with Learned Gradient Descent

Accelerated and Unaccelerated Stochastic Gradient Descent in Model Generality

A Modified Descent Spectral Conjugate Gradient Method for Unconstrained Optimization

Human Body Model Fitting by Learned Gradient Descent

Dynamic ReLU

Bi-fidelity stochastic gradient descent for structural optimization under uncertainty