Convergence of Stochastic Proximal Gradient Algorithm

  • PDF / 874,110 Bytes
  • 27 Pages / 439.37 x 666.142 pts Page_size
  • 91 Downloads / 240 Views

DOWNLOAD

REPORT


Convergence of Stochastic Proximal Gradient Algorithm Lorenzo Rosasco1,2 · Silvia Villa3 · Ba`˘ ng Công Vu˜ 4

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Abstract We study the extension of the proximal gradient algorithm where only a stochastic gradient estimate is available and a relaxation step is allowed. We establish convergence rates for function values in the convex case, as well as almost sure convergence and convergence rates for the iterates under further convexity assumptions. Our analysis avoid averaging the iterates and error summability assumptions which might not be satisfied in applications, e.g. in machine learning. Our proofing technique extends classical ideas from the analysis of deterministic proximal gradient algorithms. Keywords Proximal methods · Forward–backward splitting algorithm · Stochastic optimization

1 Introduction First order methods are widely applied to solve optimization problems in a variety of areas including machine learning [2,32] and signal processing [15]. In particular, proximal gradient algorithms, also known as forward–backward splitting algorithms [4], are natural in the common situation where the function to be minimized is the sum of a data fitting term and a regularizer. These approaches separate the contribution of each component: at each iteration the proximal operator defined by the

B

Silvia Villa [email protected] Lorenzo Rosasco [email protected] Ba`˘ ng Công V˜u [email protected]

1

DIBRIS, Università di Genova, Via Dodecaneso, 35, 16146 Genoa, Italy

2

LCSL, Istituto Italiano di Tecnologia and Massachusetts Institute of Technology, Bldg. 46-5155, 77 Massachusetts Avenue, Cambridge, MA 02139, USA

3

Dipartimento di Matematica, Università di Genova, Via Dodecaneso, 35, 16146 Genoa, Italy

4

EPFL STI IEL LIONS, ELD 243 (Batiment EL), Station 11, 1015 Lausanne, Switzerland

123

Applied Mathematics & Optimization

nonsmooth term is applied to the gradient descent step for the smooth term. In practice, it is relevant to consider situations where these operations cannot be performed exactly. For example, the case where the proximal operator is known only up-to an error have been considered see [43,50] and references therein. The other example, of interest for this paper, is when only stochastic estimates of the gradients are available. This setting is relevant in large scale statistical learning [10] and more generally in stochastic approximation [27]. The study of stochastic gradient methods goes back to [26,39] and this algorithm has been studied extensively for smooth, strongly convex objective functions [5,19,20,27,34,36]. In this context, a classic idea is averaging the iterates to allow for longer step-sizes in [35,37]. Stochastic gradient methods have also been extensively studied in the context of online learning [11]. Here, the analysis of the so called regret allows to derive convergence rates for the average iterates through a proofing technique known as online-to-batch conversion, see e.g. [44] and references therein. F