Ada-boundary: accelerating DNN training via adaptive boundary batch selection

  • PDF / 2,529,055 Bytes
  • 17 Pages / 439.37 x 666.142 pts Page_size
  • 19 Downloads / 252 Views

DOWNLOAD

REPORT


Ada‑boundary: accelerating DNN training via adaptive boundary batch selection Hwanjun Song1 · Sundong Kim2 · Minseok Kim1 · Jae‑Gil Lee1  Received: 3 December 2019 / Revised: 21 June 2020 / Accepted: 11 August 2020 © The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2020

Abstract Neural networks converge faster with help from a smart batch selection strategy. In this regard, we propose Ada-Boundary, a novel and simple adaptive batch selection algorithm that constructs an effective mini-batch according to the learning progress of the model. Our key idea is to exploit confusing samples for which the model cannot predict labels with high confidence. Thus, samples near the current decision boundary are considered to be the most effective for expediting convergence. Taking advantage of this design, Ada-Boundary maintained its dominance for various degrees of training difficulty. We demonstrate the advantage of Ada-Boundary by extensive experimentation using CNNs with five benchmark data sets. Ada-Boundary was shown to produce a relative improvement in test errors by up to 31.80% compared with the baseline for a fixed wall-clock training time, thereby achieving a faster convergence speed. Keywords  Batch selection · Acceleration · Convergence · Decision boundary

1 Introduction Deep neural networks (DNNs) have achieved remarkable performance in many fields, especially, in computer vision and natural language processing (Goodfellow et al. 2016). Nevertheless, as the size of data set grows, the training step via stochastic gradient descent (SGD) based on mini-batches suffers from extremely high computational cost, which is Editors: Ira Assent, Carlotta Domeniconi, Aristides Gionis, Eyke Hüllermeier. * Jae‑Gil Lee [email protected] Hwanjun Song [email protected] Sundong Kim [email protected] Minseok Kim [email protected] 1

Graduate School of Knowledge Service Engineering, KAIST, Daejeon, Korea

2

Institute for Basic Science, Daejeon, Korea



13

Vol.:(0123456789)

Machine Learning

Probability



Easy case (MNIST) Moderately hard

Hard

Easy

Hard batch

+

Hard

Hard case (CIFAR-10) Probability

+

-

Easy

Too hard

Hard

Easy

(a) Difficulty distribution.

+

-

Decision boundary

+

-

SGD on a hard batch

(b) Hard sample oriented training.

Fig. 1  Analysis on hard batch selection strategy: a shows the true sample distribution according to the difficulty computed by Eq. (1) at the training accuracy of 60% . An easy data set (MNIST) does not have “too hard” samples but “moderately hard” samples colored in gray, whereas a relatively hard data set (CIFAR10) has many “too hard” samples colored in black. b Shows the result of SGD on a hard batch. The moderately hard samples are informative to update a model, but the too hard samples make the model overfit to themselves

mainly due to slow convergence. The common approaches for expediting convergence include SGD variants (Zeiler 2012; Kingma and Ba 2015) that maintain individual learning rates for parame