Ritz-like values in steplength selections for stochastic gradient methods
- PDF / 1,759,487 Bytes
- 16 Pages / 595.276 x 790.866 pts Page_size
- 31 Downloads / 191 Views
FOCUS
Ritz-like values in steplength selections for stochastic gradient methods Giorgia Franchini1
· Valeria Ruggiero2
· Luca Zanni1
© Springer-Verlag GmbH Germany, part of Springer Nature 2020
Abstract The steplength selection is a crucial issue for the effectiveness of the stochastic gradient methods for large-scale optimization problems arising in machine learning. In a recent paper, Bollapragada et al. (SIAM J Optim 28(4):3312–3343, 2018) propose to include an adaptive subsampling strategy into a stochastic gradient scheme, with the aim to assure the descent feature in expectation of the stochastic gradient directions. In this approach, theoretical convergence properties are preserved under the assumption that the positive steplength satisfies at any iteration a suitable bound depending on the inverse of the Lipschitz constant of the objective function gradient. In this paper, we propose to tailor for the stochastic gradient scheme the steplength selection adopted in the full-gradient method knows as limited memory steepest descent method. This strategy, based on the Ritz-like values of a suitable matrix, enables to give a local estimate of the inverse of the local Lipschitz parameter, without introducing line search techniques, while the possible increase in the size of the subsample used to compute the stochastic gradient enables to control the variance of this direction. An extensive numerical experimentation highlights that the new rule makes the tuning of the parameters less expensive than the trial procedure for the efficient selection of a constant step in standard and mini-batch stochastic gradient methods. Keywords Stochastic gradient methods · Learning rate selection rule · Ritz-like values · Adaptive subsampling strategies · Reduction variance techniques
1 Introduction The problem we consider is the unconstrained minimization of the form min F(x) ≡ E[ f (x, ξ )],
x∈Rd
(1)
where ξ is a multi-value random variable, f represents a cost function, and the mathematical expectation E is defined with
respect to ξ in the probability space (Ξ , F, P). It is assumed that the function f : Rd × Ξ → R is known analytically or it is provided by a black box oracle within a prefixed accuracy. In practice, since the probability distribution of ξ is unknown, we seek the solution of a problem that involves an estimate of the objective function F(x). The most common approximation is the Sample Average Approximation, defined as min Fn (x) ≡ Fn (x, ξ (n) ),
(2)
Communicated by Yaroslav D. Sergeyev.
x∈Rd
B
where the objective function is the empirical risk
Giorgia Franchini [email protected] Valeria Ruggiero [email protected] Luca Zanni [email protected]
1
Department of Physics, Informatics and Mathematics, University of Modena and Reggio Emilia, Modena, Italy
2
Department of Mathematics and Computer Science, University of Ferrara, Ferrara, Italy
Fn (x, ξ
(n)
n n 1 1 (n) )= f (x, ξi ) = f i (x), n n i=1
(3)
i=1
based on a random sample ξ (n) = {ξ1(n) , . . . , ξn(n) } of si
Data Loading...