A linearly convergent stochastic recursive gradient method for convex optimization
- PDF / 530,619 Bytes
- 19 Pages / 439.37 x 666.142 pts Page_size
- 92 Downloads / 258 Views
A linearly convergent stochastic recursive gradient method for convex optimization Yan Liu1,2 · Xiao Wang1,2 · Tiande Guo1,2 Received: 4 April 2019 / Accepted: 8 February 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020
Abstract The stochastic recursive gradient algorithm (SARAH) (Nguyen et al. in: Neural information processing systems, pp 2613–2621, 2017) attracts much interest recently. It admits a simple recursive framework for updating stochastic gradient estimates. Motivated by this, in this paper we propose a new stochastic recursive gradient method, called SARAH-I. Different from SARAH, SARAH-I incorporates importance sampling strategy and computes the full gradient at the last iterate in each inner iteration. We show that the sequence of distances between iterates and the optima set is linearly convergent under both strong convexity and non-strong convexity conditions. Furthermore, we propose to use the Barzilai–Borwein (BB) approach to adaptively compute step sizes for SARAH-I, and name the resulting method as SARAH-I-BB. We establish its convergence and complexity properties in different cases. Finally numerical tests are reported to indicate promising performances of proposed algorithms. Keywords Stochastic optimization · Stochastic gradient · BB method · Linear convergence rate · Complexity
1 Introduction In the context of large scale machine learning, the following type of optimization problems is widely considered:
B
Xiao Wang [email protected] Yan Liu [email protected] Tiande Guo [email protected]
1
School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
2
Key Laboratory of Big Data Mining and Knowledge Management, University of Chinese Academy of Sciences, Beijing 100049, China
123
Y. Liu et al.
min
w∈Rd
P(w) =
n 1 f i (w), n
(1)
i=1
where n is the sample size, and each f i , i ∈ {1, 2, . . . , n} is first-order continuously differentiable. The problem (1) is challenging when n is extremely large. Because the exact full gradient information is not easy to obtain, exact gradient-based methods are impractical and prohibited. Stochastic gradient descent (SGD) method, however, traced back to the seminal work by [1], has become the dominating approach for solving (1). In the tth iteration, SGD picks an index i ∈ [n] at random, and updates the iterate wt by wt+1 = wt − ηt ∇ f i (wt ), where ηt > 0 is the step size, and ∇ f i (wt ) denotes the sample gradient at wt . A surge of methods to improve the performance of SGD have been developed. One type of most prevalent methods are gradient aggregation algorithms [3], such as SAG [4,5] and SAGA [6]. They compute a stochastic gradient as an average of stochastic gradients evaluated at previous iterates. Then they store previous stochastic gradients at the expense of memory. SVRG [7] has two loops, with a full gradient computed in the outer loop (each outer iteration is called an epoch) and stochastic gradients with lower variance computed in the inner loop. S2GD [14] runs a ra
Data Loading...