Ritz-like values in steplength selections for stochastic gradient methods

PDF / 1,759,487 Bytes
16 Pages / 595.276 x 790.866 pts Page_size
31 Downloads / 269 Views

FOCUS

Ritz-like values in steplength selections for stochastic gradient methods Giorgia Franchini1

· Valeria Ruggiero2

· Luca Zanni1

© Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract The steplength selection is a crucial issue for the effectiveness of the stochastic gradient methods for large-scale optimization problems arising in machine learning. In a recent paper, Bollapragada et al. (SIAM J Optim 28(4):3312–3343, 2018) propose to include an adaptive subsampling strategy into a stochastic gradient scheme, with the aim to assure the descent feature in expectation of the stochastic gradient directions. In this approach, theoretical convergence properties are preserved under the assumption that the positive steplength satisfies at any iteration a suitable bound depending on the inverse of the Lipschitz constant of the objective function gradient. In this paper, we propose to tailor for the stochastic gradient scheme the steplength selection adopted in the full-gradient method knows as limited memory steepest descent method. This strategy, based on the Ritz-like values of a suitable matrix, enables to give a local estimate of the inverse of the local Lipschitz parameter, without introducing line search techniques, while the possible increase in the size of the subsample used to compute the stochastic gradient enables to control the variance of this direction. An extensive numerical experimentation highlights that the new rule makes the tuning of the parameters less expensive than the trial procedure for the efficient selection of a constant step in standard and mini-batch stochastic gradient methods. Keywords Stochastic gradient methods · Learning rate selection rule · Ritz-like values · Adaptive subsampling strategies · Reduction variance techniques

1 Introduction The problem we consider is the unconstrained minimization of the form min F(x) ≡ E[ f (x, ξ )],

x∈Rd

(1)

where ξ is a multi-value random variable, f represents a cost function, and the mathematical expectation E is defined with

respect to ξ in the probability space (Ξ , F, P). It is assumed that the function f : Rd × Ξ → R is known analytically or it is provided by a black box oracle within a prefixed accuracy. In practice, since the probability distribution of ξ is unknown, we seek the solution of a problem that involves an estimate of the objective function F(x). The most common approximation is the Sample Average Approximation, defined as min Fn (x) ≡ Fn (x, ξ (n) ),

(2)

Communicated by Yaroslav D. Sergeyev.

x∈Rd

B

where the objective function is the empirical risk

Giorgia Franchini [email protected] Valeria Ruggiero [email protected] Luca Zanni [email protected]

1

Department of Physics, Informatics and Mathematics, University of Modena and Reggio Emilia, Modena, Italy

2

Department of Mathematics and Computer Science, University of Ferrara, Ferrara, Italy

Fn (x, ξ

(n)

n n 1 1 (n) )= f (x, ξi ) = f i (x), n n i=1

(3)

i=1

based on a random sample ξ (n) = {ξ1(n) , . . . , ξn(n) } of si

Data Loading...

Ritz-like values in steplength selections for stochastic gradient methods

Recommend Documents

Accelerating variance-reduced stochastic gradient methods

Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods

RSM-Based Stochastic Gradient Procedures

Gradient-Free Methods with Inexact Oracle for Convex-Concave Stochastic Saddle-Point Problem

Reduced gradient methods

Stochastic Methods in Fluid Mechanics

Weak approximation of transformed stochastic gradient MCMC

Convergence of Stochastic Proximal Gradient Algorithm

Parallel sequential Monte Carlo for stochastic gradient-free nonconvex optimization

Some Stochastic Gradient Algorithms for Hammerstein Systems with Piecewise Linearity

Bi-fidelity stochastic gradient descent for structural optimization under uncertainty

A linearly convergent stochastic recursive gradient method for convex optimization