LipschitzLR: Using theoretically computed adaptive learning rates for fast convergence

PDF / 2,266,673 Bytes
19 Pages / 595.224 x 790.955 pts Page_size
20 Downloads / 228 Views

LipschitzLR: Using theoretically computed adaptive learning rates for fast convergence Rahul Yedida1

· Snehanshu Saha2 · Tejas Prashanth3

© Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract We present a novel theoretical framework for computing large, adaptive learning rates. Our framework makes minimal assumptions on the activations used and exploits the functional properties of the loss function. Specifically, we show that the inverse of the Lipschitz constant of the loss function is an ideal learning rate. We analytically compute formulas for the Lipschitz constant of several loss functions, and through extensive experimentation, demonstrate the strength of our approach using several architectures and datasets. In addition, we detail the computation of learning rates when other optimizers, namely, SGD with momentum, RMSprop, and Adam, are used. Compared to standard choices of learning rates, our approach converges faster, and yields better results. Keywords Lipschitz constant · Adaptive learning · Machine learning · Deep learning

1 Introduction

When optimizing a function f with respect to a parameter w, the gradient descent update rule is given by

Gradient descent [6] is a popular optimization algorithm for finding optima for functions, and is used to find optima in loss functions in machine learning tasks. In an iterative process, it seeks to update randomly initialized weights to minimize the training error. These updates are typically small values proportional to the gradient of the loss function. The constant of proportionality is called the learning rate, and is usually manually chosen in the gradient descent rule.

w := w − α · ∇w f

Rahul Yedida

[email protected] Snehanshu Saha [email protected] Tejas Prashanth [email protected] 1

Department of Computer Science, North Carolina State University, Raleigh, NC, USA

2

Department of CS, IS and APPCAIR, BITS Pilani K K Birla Goa Campus, Bengaluru, India

3

Department of Computer Science, PES University, Bengaluru, India

(1)

The generalization ability of stochastic gradient descent (SGD) and various methods of faster optimization have quickly gained interest in machine learning and deep learning communities. Several directions have been taken to understand these phenomena. The interest in the stability of SGD is one such direction [14, 22]. Others have proven that gradient descent can find the global minima of the loss functions in over-parameterized deep neural networks [7, 49]. More practical approaches in this regard have involved novel changes to the optimization procedure itself. These include adding a “momentum” term to the update rule [39], and “adaptive gradient” methods such as RMSProp [42], and Adam [20]. These methods have seen widespread use in deep neural networks[1, 27, 45]. Other methods rely on an approximation of the Hessian. These include the BroydenFletcher-Goldfarb-Shanno (BFGS) [5, 9, 12, 32] and LBFGS [23] algorithms. However, our proposed method does not require any modification of th

Data Loading...

LipschitzLR: Using theoretically computed adaptive learning rates for fast convergence

Recommend Documents

Fast and strong convergence of online learning algorithms

Algorithms for Accelerated Convergence of Emission Computed Tomography Medical IMAGES

Two-Step FORCE Learning Algorithm for Fast Convergence in Reservoir Computing

Convergence Rates of Solutions for Elliptic Reiterated Homogenization Problems

Convergence rates of Gaussian ODE filters

Motor Control using Adaptive Time-Delay Learning

Fast Depth Video Coding Method Using Adaptive Edge Classification

Convergence Parameters for D-Type Learning Function

Vision based inspection system for leather surface defect detection using fast convergence particle swarm optimization e

Adaptive Representations for Reinforcement Learning

Adaptive Learning

Smart Parking System to Predict Occupancy Rates Using Machine Learning