LipschitzLR: Using theoretically computed adaptive learning rates for fast convergence

  • PDF / 2,266,673 Bytes
  • 19 Pages / 595.224 x 790.955 pts Page_size
  • 20 Downloads / 205 Views

DOWNLOAD

REPORT


LipschitzLR: Using theoretically computed adaptive learning rates for fast convergence Rahul Yedida1

· Snehanshu Saha2 · Tejas Prashanth3

© Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract We present a novel theoretical framework for computing large, adaptive learning rates. Our framework makes minimal assumptions on the activations used and exploits the functional properties of the loss function. Specifically, we show that the inverse of the Lipschitz constant of the loss function is an ideal learning rate. We analytically compute formulas for the Lipschitz constant of several loss functions, and through extensive experimentation, demonstrate the strength of our approach using several architectures and datasets. In addition, we detail the computation of learning rates when other optimizers, namely, SGD with momentum, RMSprop, and Adam, are used. Compared to standard choices of learning rates, our approach converges faster, and yields better results. Keywords Lipschitz constant · Adaptive learning · Machine learning · Deep learning

1 Introduction

When optimizing a function f with respect to a parameter w, the gradient descent update rule is given by

Gradient descent [6] is a popular optimization algorithm for finding optima for functions, and is used to find optima in loss functions in machine learning tasks. In an iterative process, it seeks to update randomly initialized weights to minimize the training error. These updates are typically small values proportional to the gradient of the loss function. The constant of proportionality is called the learning rate, and is usually manually chosen in the gradient descent rule.

w := w − α · ∇w f

 Rahul Yedida

[email protected] Snehanshu Saha [email protected] Tejas Prashanth [email protected] 1

Department of Computer Science, North Carolina State University, Raleigh, NC, USA

2

Department of CS, IS and APPCAIR, BITS Pilani K K Birla Goa Campus, Bengaluru, India

3

Department of Computer Science, PES University, Bengaluru, India

(1)

The generalization ability of stochastic gradient descent (SGD) and various methods of faster optimization have quickly gained interest in machine learning and deep learning communities. Several directions have been taken to understand these phenomena. The interest in the stability of SGD is one such direction [14, 22]. Others have proven that gradient descent can find the global minima of the loss functions in over-parameterized deep neural networks [7, 49]. More practical approaches in this regard have involved novel changes to the optimization procedure itself. These include adding a “momentum” term to the update rule [39], and “adaptive gradient” methods such as RMSProp [42], and Adam [20]. These methods have seen widespread use in deep neural networks[1, 27, 45]. Other methods rely on an approximation of the Hessian. These include the BroydenFletcher-Goldfarb-Shanno (BFGS) [5, 9, 12, 32] and LBFGS [23] algorithms. However, our proposed method does not require any modification of th