Ada-boundary: accelerating DNN training via adaptive boundary batch selection
- PDF / 2,529,055 Bytes
- 17 Pages / 439.37 x 666.142 pts Page_size
- 19 Downloads / 252 Views
Ada‑boundary: accelerating DNN training via adaptive boundary batch selection Hwanjun Song1 · Sundong Kim2 · Minseok Kim1 · Jae‑Gil Lee1 Received: 3 December 2019 / Revised: 21 June 2020 / Accepted: 11 August 2020 © The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2020
Abstract Neural networks converge faster with help from a smart batch selection strategy. In this regard, we propose Ada-Boundary, a novel and simple adaptive batch selection algorithm that constructs an effective mini-batch according to the learning progress of the model. Our key idea is to exploit confusing samples for which the model cannot predict labels with high confidence. Thus, samples near the current decision boundary are considered to be the most effective for expediting convergence. Taking advantage of this design, Ada-Boundary maintained its dominance for various degrees of training difficulty. We demonstrate the advantage of Ada-Boundary by extensive experimentation using CNNs with five benchmark data sets. Ada-Boundary was shown to produce a relative improvement in test errors by up to 31.80% compared with the baseline for a fixed wall-clock training time, thereby achieving a faster convergence speed. Keywords Batch selection · Acceleration · Convergence · Decision boundary
1 Introduction Deep neural networks (DNNs) have achieved remarkable performance in many fields, especially, in computer vision and natural language processing (Goodfellow et al. 2016). Nevertheless, as the size of data set grows, the training step via stochastic gradient descent (SGD) based on mini-batches suffers from extremely high computational cost, which is Editors: Ira Assent, Carlotta Domeniconi, Aristides Gionis, Eyke Hüllermeier. * Jae‑Gil Lee [email protected] Hwanjun Song [email protected] Sundong Kim [email protected] Minseok Kim [email protected] 1
Graduate School of Knowledge Service Engineering, KAIST, Daejeon, Korea
2
Institute for Basic Science, Daejeon, Korea
13
Vol.:(0123456789)
Machine Learning
Probability
Easy case (MNIST) Moderately hard
Hard
Easy
Hard batch
+
Hard
Hard case (CIFAR-10) Probability
+
-
Easy
Too hard
Hard
Easy
(a) Difficulty distribution.
+
-
Decision boundary
+
-
SGD on a hard batch
(b) Hard sample oriented training.
Fig. 1 Analysis on hard batch selection strategy: a shows the true sample distribution according to the difficulty computed by Eq. (1) at the training accuracy of 60% . An easy data set (MNIST) does not have “too hard” samples but “moderately hard” samples colored in gray, whereas a relatively hard data set (CIFAR10) has many “too hard” samples colored in black. b Shows the result of SGD on a hard batch. The moderately hard samples are informative to update a model, but the too hard samples make the model overfit to themselves
mainly due to slow convergence. The common approaches for expediting convergence include SGD variants (Zeiler 2012; Kingma and Ba 2015) that maintain individual learning rates for parame
Data Loading...