Accelerating variance-reduced stochastic gradient methods

  • PDF / 1,111,669 Bytes
  • 45 Pages / 439.37 x 666.142 pts Page_size
  • 75 Downloads / 195 Views

DOWNLOAD

REPORT


Series A

Accelerating variance-reduced stochastic gradient methods Derek Driggs1

· Matthias J. Ehrhardt2 · Carola-Bibiane Schönlieb1

Received: 8 October 2019 / Accepted: 7 September 2020 © The Author(s) 2020

Abstract Variance reduction is a crucial tool for improving the slow convergence of stochastic gradient descent. Only a few variance-reduced methods, however, have yet been shown to directly benefit from Nesterov’s acceleration techniques to match the convergence rates of accelerated gradient methods. Such approaches rely on “negative momentum”, a technique for further variance reduction that is generally specific to the SVRG gradient estimator. In this work, we show for the first time that negative momentum is unnecessary for acceleration and develop a universal acceleration framework that allows all popular variance-reduced methods to achieve accelerated convergence rates. The constants appearing in these rates, including their dependence on the number of functions n, scale with the mean-squared-error and bias of the gradient estimator. In a series of numerical experiments, we demonstrate that versions of SAGA, SVRG, SARAH, and SARGE using our framework significantly outperform non-accelerated versions and compare favourably with algorithms using negative momentum. Keywords Stochastic optimisation · Convex optimisation · Variance reduction · Accelerated gradient descent Mathematics Subject Classification 90C06 · 90C15 · 90C25 · 90C30 · 90C60 · 68Q25

C.-B.S. and M.J.E. acknowledge support from the EPSRC Grant No. EP/S026045/1. C.-B.S. further acknowledges support from the Leverhulme Trust project on ‘Breaking the non-convexity barrier’, the Philip Leverhulme Prize, the EPSRC grant EP/T003553/1, the EPSRC Centre Nr. EP/N014588/1, the Wellcome Innovator Award RG98755, European Union Horizon 2020 research and innovation programmes under the Marie Skodowska-Curie Grant Agreement No. 777826 NoMADS and No. 691070 CHiPS, the Cantab Capital Institute for the Mathematics of Information and the Alan Turing Institute. D.D. acknowledges support from the Gates Cambridge Trust and the Cantab Capital Institute for the Mathematics of Information.

B

Derek Driggs [email protected]

Extended author information available on the last page of the article

123

D. Driggs et al.

1 Introduction We are interested in solving the following composite convex minimisation problem:  min

x∈Rm

 n 1 f i (x) + g(x) . F(x) = f (x) + g(x) = n def

def

(1)

i=1

Throughout, we assume f i : Rm → R are convex and have L-Lipschitz continuous gradients for all i. We also assume g : Rm → R∪{∞} is proper, lower semicontinuous, and μ-strongly convex with μ ≥ 0, but we do not require g to be differentiable. Problems of this form are ubiquitous in many fields, including machine learning, compressed sensing, and image processing (see, e.g., [11,12,25,35]). Fundamental examples include LASSO [35] and matrix completion [12], where f is a least-squares loss and g is the 1 or nuclear norm, respectively, and sparse logistic regression, wher