Accelerated and Unaccelerated Stochastic Gradient Descent in Model Generality

PDF / 678,664 Bytes
12 Pages / 612 x 792 pts (letter) Page_size
46 Downloads / 225 Views

lerated and Unaccelerated Stochastic Gradient Descent in Model Generality D. M. Dvinskikh1, 2, 3* , A. I. Tyurin4** , A. V. Gasnikov2, 3, 4*** , and C. C. Omel’chenko2**** 1

Weierstrass Institute, Berlin, 10117 Germany Moscow Institute of Physics and Technology (State University), Dolgoprudnyi, Moscow Oblast, 141701 Russia 3 Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, 101447 Russia 4 National Research University Higher School of Economics, Moscow 101000 Russia 2

Received April 11, 2020; in ﬁnal form, May 20, 2020; accepted June 16, 2020

Abstract—A new method for deriving estimates of the rate of convergence of optimal methods for solving problems of smooth (strongly) convex stochastic optimization is described. The method is based on the results of stochastic optimization derived from results on the convergence of optimal methods under the conditions of inexact gradients with small noises of nonrandom nature. In contrast to earlier results, all estimates in the present paper are obtained in model generality. DOI: 10.1134/S0001434620090230 Keywords: stochastic optimization, accelerated gradient descent, model generality, composite optimization.

1. INTRODUCTION In the present paper, we consider the following stochastic optimization problem [1]–[3]: f (x) = E[f (x, ξ)] →

min ,

x∈Q⊆Rn

(1.1)

where the set Q is assumed to be convex and closed, ξ is a random variable, the expectation E[f (x, ξ)] is deﬁned and ﬁnite for any x ∈ Q, and the function f (x) is μ-strongly convex in the 2-norm (μ ≥ 0) and has L-Lipschitz gradient, i.e., for all x, y ∈ Q, L μ f (x) + ∇f (x), y − x + y − x22 ≤ f (y) ≤ f (x) + ∇f (x), y − x + y − x22 . 2 2 We assume that ∇f (x, ξ) is an available stochastic gradient of f (x) satisfying the following conditions1 (the tails of the distribution are unbiased and sub-Gaussian with sub-Gaussian variance σ 2 ): ∇f (x, ξ) − E[∇f (x, ξ)]22 ≤ exp(1) (1.2) E[∇f (x, ξ)] ≡ ∇f (x), E exp σ2 for all x ∈ Q. *

E-mail: [email protected] E-mail: [email protected] *** E-mail: [email protected] **** E-mail: [email protected] 1 Note that, for minimization problems for functionals of sum type, the boundedness condition on the (sub-Gaussian) variance may be violated even in very simple (quadratic) situations. As a consequence, in the general case, the results given in this paper cannot be extended to minimization problems for functionals of sum type in which the gradient of a randomly chosen summand [4] is chosen for the stochastic gradient. **

511

512

DVINSKIKH et al.

Then, after N calculations of ∇f (x, ξ), with large probability we have2 [1]–[3] 1/p 2 2 LR σR μ σ N 2 min f (xN ) − f (x∗ ) = O + , (1.3) + √ , LR exp − Np L 2 μN N where x∗ is a solution of problem (1.1), R = x0 − x∗ 2 , x0 is the starting point, p = 1 corresponds to a stochastic gradient descent, and p = 2, to an accelerated stochastic descent. On the other hand, it is known (see [1], [2], [5]–[7]) that if, for problem (

Data Loading...

Accelerated and Unaccelerated Stochastic Gradient Descent in Model Generality

Recommend Documents

Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods

Bi-fidelity stochastic gradient descent for structural optimization under uncertainty

Large Scale Optimization with Proximal Stochastic Newton-Type Gradient Descent

An Improved Privacy-Preserving Stochastic Gradient Descent Algorithm

Human Body Model Fitting by Learned Gradient Descent

Local Attractors for Gradient-Related Descent Iterations

GPU_MF_SGD: A Novel GPU-Based Stochastic Gradient Descent Method for Matrix Factorization

RSM-Based Stochastic Gradient Procedures

Discretely Distributed Stochastic Programs: Descent Directions and Efficient Points

Accelerated Diagonal Steepest Descent Method for Unconstrained Multiobjective Optimization

Broad Learning System with Proportional-Integral-Differential Gradient Descent

Single-View 3D Shape Reconstruction with Learned Gradient Descent