Backtracking Gradient Descent Method and Some Applications in Large Scale Optimisation. Part 2: Algorithms and Experimen
- PDF / 769,615 Bytes
- 30 Pages / 439.37 x 666.142 pts Page_size
- 29 Downloads / 181 Views
Backtracking Gradient Descent Method and Some Applications in Large Scale Optimisation. Part 2: Algorithms and Experiments Tuyen Trung Truong1
· Hang-Tuan Nguyen2
© The Author(s) 2020
Abstract In this paper, we provide new results and algorithms (including backtracking versions of Nesterov accelerated gradient and Momentum) which are more applicable to large scale optimisation as in Deep Neural Networks. We also demonstrate that Backtracking Gradient Descent (Backtracking GD) can obtain good upper bound estimates for local Lipschitz constants for the gradient, and that the convergence rate of Backtracking GD is similar to that in classical work of Armijo. Experiments with datasets CIFAR10 and CIFAR100 on various popular architectures verify a heuristic argument that Backtracking GD stabilises to a finite union of sequences constructed from Standard GD for the mini-batch practice, and show that our new algorithms (while automatically fine tuning learning rates) perform better than current state-of-the-art methods such as Adam, Adagrad, Adadelta, RMSProp, Momentum and Nesterov accelerated gradient. To help readers avoiding the confusion between heuristics and more rigorously justified algorithms, we also provide a review of the current state of convergence results for gradient descent methods. Accompanying source codes are available on GitHub. Keywords Automation of learning rates · Backtracking · Deep neural networks · Random dynamical systems · Global convergence · Gradient descent · Image classification · Iterative optimisation · Large scale optimisation · Local minimum Mathematics Subject Classification 65Kxx · 68Txx · 49Mxx · 68Uxx
B
Tuyen Trung Truong [email protected] Hang-Tuan Nguyen [email protected]
1
Matematikk Institut, Universitetet i Oslo, Blindern 0851, Oslo, Norway
2
Axon AI Research, Seattle, WA, USA
123
Applied Mathematics & Optimization
1 Introduction In this section, we provide a non-technical overview of the important role and current practices of Gradient Descent methods (GD) in optimisation, in particular in large scale optimisation as in Deep Neural Networks (DNN), and some new features of our main results in this paper. One special feature of the modern society is the need of solving large scale optimisation problems quickly, stably, efficiently and reproducibly. One exemplar for this is the development of Deep Learning, which has obtained spectacular achievements recently. Among the most famous novelties, one can mention Alpha Go (the first computer program that ever won human players, and in fact the best human players, in the Game of Go) and new developments in self-driving cars. One important tool in Deep Learning is DNN. Roughly speaking, a DNN is a parametrised family f (x, θ ), a composition of very simple maps, aiming to approximate well a feature y. By choosing an explicit metric to measure the difference between the prediction by the DNN f (x, θ ) and the “ground truth” y over a training set I , one gets a specific value of θ as one global minimum of an associated opti
Data Loading...