Large Scale Optimization with Proximal Stochastic Newton-Type Gradient Descent

In this work, we generalized and unified two recent completely different works of Jascha [10 ] and Lee [2 ] respectively into one by proposing the proximal stochastic Newton-type gradient (PROXTONE) method for optimizing the sums of two convex functions:

  • PDF / 427,897 Bytes
  • 14 Pages / 439.37 x 666.142 pts Page_size
  • 13 Downloads / 235 Views

DOWNLOAD

REPORT


Abstract. In this work, we generalized and unified two recent completely different works of Jascha [10] and Lee [2] respectively into one by proposing the proximal stochastic Newton-type gradient (PROXTONE) method for optimizing the sums of two convex functions: one is the average of a huge number of smooth convex functions, and the other is a nonsmooth convex function. Our PROXTONE incorporates second order information to obtain stronger convergence results, that it achieves a linear convergence rate not only in the value of the objective function, but also for the solution. The proofs are simple and intuitive, and the results and technique can be served as a initiate for the research on the proximal stochastic methods that employ second order information. The methods and principles proposed in this paper can be used to do logistic regression, training of deep neural network and so on. Our numerical experiments shows that the PROXTONE achieves better computation performance than existing methods.

1

Introduction and Problem Statement

In this work, we consider the problems of the following form: 1 gi (x) + h(x), n i=1 n

minp f (x) :=

x∈R

(1)

where gi is a smooth convex loss function associated with a sample in a training set, and n h is a non-smooth convex penalty function or a regularizer. Let g(x) = n1 i=0 gi (x). We assume the optimal value f  is attained at some optimal solution x , not necessarily unique. Problems of this form often arise in machine learning, such as the least-squares regression, the Lasso, the elastic net, the logistic regression, and deep neural network. For optimizing (1), the standard and popular proximal full gradient method (ProxFG) uses iterations of the form   1 x − xk−1 2 + h(x) , (2) xk+1 = arg minp ∇g(xk )T x + x∈R 2αk

c Springer International Publishing Switzerland 2015  A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 691–704, 2015. DOI: 10.1007/978-3-319-23528-8 43

692

Z. Shi and R. Liu

where αk is the step size at the k-th iteration. Under standard assumptions the sub-optimality achieved on iteration k of the ProxFG method with a constant step size is given by 1 f (xk ) − f (x∗ ) = O( ). k When f is strongly-convex, the error satisfies [11]  L − μg k f (xk ) − f (x∗ ) = O( ), L + μh where L is the Lipschitz constant of f (x), μg , and μh are the convexity parameters of g(x) and h(x) respectively. These notations mentioned here will be detailed in Section 1.1. This result in a linear convergence rate, which is also known as a geometric or exponential rate because the error is cut by a fixed fraction on each iteration. Unfortunately, the ProxFG methods can be unappealing when n is large or huge because its iteration cost scales linearly in n. When the number of components n is very large, then each iteration of (2) will be very expensive since it requires computing the gradients for all the n component functions gi , and also their average. To overcome this problem, researchers proposed the proximal stochastic gradient descent methods (ProxSGD), wh