Accelerated and Unaccelerated Stochastic Gradient Descent in Model Generality

  • PDF / 678,664 Bytes
  • 12 Pages / 612 x 792 pts (letter) Page_size
  • 46 Downloads / 208 Views

DOWNLOAD

REPORT


lerated and Unaccelerated Stochastic Gradient Descent in Model Generality D. M. Dvinskikh1, 2, 3* , A. I. Tyurin4** , A. V. Gasnikov2, 3, 4*** , and C. C. Omel’chenko2**** 1

Weierstrass Institute, Berlin, 10117 Germany Moscow Institute of Physics and Technology (State University), Dolgoprudnyi, Moscow Oblast, 141701 Russia 3 Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, 101447 Russia 4 National Research University Higher School of Economics, Moscow 101000 Russia 2

Received April 11, 2020; in final form, May 20, 2020; accepted June 16, 2020

Abstract—A new method for deriving estimates of the rate of convergence of optimal methods for solving problems of smooth (strongly) convex stochastic optimization is described. The method is based on the results of stochastic optimization derived from results on the convergence of optimal methods under the conditions of inexact gradients with small noises of nonrandom nature. In contrast to earlier results, all estimates in the present paper are obtained in model generality. DOI: 10.1134/S0001434620090230 Keywords: stochastic optimization, accelerated gradient descent, model generality, composite optimization.

1. INTRODUCTION In the present paper, we consider the following stochastic optimization problem [1]–[3]: f (x) = E[f (x, ξ)] →

min ,

x∈Q⊆Rn

(1.1)

where the set Q is assumed to be convex and closed, ξ is a random variable, the expectation E[f (x, ξ)] is defined and finite for any x ∈ Q, and the function f (x) is μ-strongly convex in the 2-norm (μ ≥ 0) and has L-Lipschitz gradient, i.e., for all x, y ∈ Q, L μ f (x) + ∇f (x), y − x + y − x22 ≤ f (y) ≤ f (x) + ∇f (x), y − x + y − x22 . 2 2 We assume that ∇f (x, ξ) is an available stochastic gradient of f (x) satisfying the following conditions1 (the tails of the distribution are unbiased and sub-Gaussian with sub-Gaussian variance σ 2 ):    ∇f (x, ξ) − E[∇f (x, ξ)]22 ≤ exp(1) (1.2) E[∇f (x, ξ)] ≡ ∇f (x), E exp σ2 for all x ∈ Q. *

E-mail: [email protected] E-mail: [email protected] *** E-mail: [email protected] **** E-mail: [email protected] 1 Note that, for minimization problems for functionals of sum type, the boundedness condition on the (sub-Gaussian) variance may be violated even in very simple (quadratic) situations. As a consequence, in the general case, the results given in this paper cannot be extended to minimization problems for functionals of sum type in which the gradient of a randomly chosen summand [4] is chosen for the stochastic gradient. **

511

512

DVINSKIKH et al.

Then, after N calculations of ∇f (x, ξ), with large probability we have2 [1]–[3]      1/p  2 2 LR σR μ σ N 2  min f (xN ) − f (x∗ ) = O + , (1.3) + √ , LR exp − Np L 2 μN N where x∗ is a solution of problem (1.1), R = x0 − x∗ 2 , x0 is the starting point, p = 1 corresponds to a stochastic gradient descent, and p = 2, to an accelerated stochastic descent. On the other hand, it is known (see [1], [2], [5]–[7]) that if, for problem (