Gradient descent optimizes over-parameterized deep ReLU networks

  • PDF / 754,195 Bytes
  • 26 Pages / 439.37 x 666.142 pts Page_size
  • 16 Downloads / 242 Views

DOWNLOAD

REPORT


Gradient descent optimizes over-parameterized deep ReLU networks Difan Zou1 · Yuan Cao1 · Dongruo Zhou1 · Quanquan Gu1 Received: 4 May 2019 / Revised: 6 July 2019 / Accepted: 6 September 2019 © The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2019

Abstract We study the problem of training deep fully connected neural networks with Rectified Linear Unit (ReLU) activation function and cross entropy loss function for binary classification using gradient descent. We show that with proper random weight initialization, gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under certain assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by gradient descent produces a sequence of iterates that stay inside a small perturbation region centered at the initial weights, in which the training loss function of the deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of gradient descent. At the core of our proof technique is (1) a milder assumption on the training data; (2) a sharp analysis of the trajectory length for gradient descent; and (3) a finer characterization of the size of the perturbation region. Compared with the concurrent work (Allen-Zhu et al. in A convergence theory for deep learning via over-parameterization, 2018a; Du et al. in Gradient descent finds global minima of deep neural networks, 2018a) along this line, our result relies on milder over-parameterization condition on the neural network width, and enjoys faster global convergence rate of gradient descent for training deep neural networks. Keywords Deep neural networks · Gradient descent · Over-parameterization · Random initialization · Global convergence

D. Zou, Y. Cao: Equal contribution. Editors: Kee-Eung Kim and Jun Zhu.

B

Quanquan Gu [email protected] Difan Zou [email protected] Yuan Cao [email protected] Dongruo Zhou [email protected]

1

Department of Computer Science, University of California, Los Angeles, CA 90095, USA

123

Machine Learning

1 Introduction Deep neural networks have achieved great success in many applications like image processing (Krizhevsky et al. 2012), speech recognition (Hinton et al. 2012) and Go games (Silver et al. 2016). However, the reason why deep networks work well in these fields remains a mystery for long time. Different lines of research try to understand the mechanism of deep neural networks from different aspects. For example, a series of work tries to understand how the expressive power of deep neural networks are related to their architecture, including the width of each layer and depth of the network (Telgarsky 2015, 2016; Lu et al. 2017; Liang and Srikant 2016; Yarotsky 2017, 2018; Hanin 2017; Hanin and Sellke 2017). These work shows that multi-layer networks with wide layers can approximate arbitrary continuous function. Very recently, there emerges a large body of work that study the global convergen