Large Scale Optimization with Proximal Stochastic Newton-Type Gradient Descent

In this work, we generalized and unified two recent completely different works of Jascha [10 ] and Lee [2 ] respectively into one by proposing the proximal stochastic Newton-type gradient (PROXTONE) method for optimizing the sums of two convex functions:

PDF / 427,897 Bytes
14 Pages / 439.37 x 666.142 pts Page_size
13 Downloads / 251 Views

DOWNLOAD

REPORT

Abstract. In this work, we generalized and uniﬁed two recent completely diﬀerent works of Jascha [10] and Lee [2] respectively into one by proposing the proximal stochastic Newton-type gradient (PROXTONE) method for optimizing the sums of two convex functions: one is the average of a huge number of smooth convex functions, and the other is a nonsmooth convex function. Our PROXTONE incorporates second order information to obtain stronger convergence results, that it achieves a linear convergence rate not only in the value of the objective function, but also for the solution. The proofs are simple and intuitive, and the results and technique can be served as a initiate for the research on the proximal stochastic methods that employ second order information. The methods and principles proposed in this paper can be used to do logistic regression, training of deep neural network and so on. Our numerical experiments shows that the PROXTONE achieves better computation performance than existing methods.

1

Introduction and Problem Statement

In this work, we consider the problems of the following form: 1 gi (x) + h(x), n i=1 n

minp f (x) :=

x∈R

(1)

where gi is a smooth convex loss function associated with a sample in a training set, and n h is a non-smooth convex penalty function or a regularizer. Let g(x) = n1 i=0 gi (x). We assume the optimal value f is attained at some optimal solution x , not necessarily unique. Problems of this form often arise in machine learning, such as the least-squares regression, the Lasso, the elastic net, the logistic regression, and deep neural network. For optimizing (1), the standard and popular proximal full gradient method (ProxFG) uses iterations of the form 1 x − xk−1 2 + h(x) , (2) xk+1 = arg minp ∇g(xk )T x + x∈R 2αk

c Springer International Publishing Switzerland 2015 A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 691–704, 2015. DOI: 10.1007/978-3-319-23528-8 43

692

Z. Shi and R. Liu

where αk is the step size at the k-th iteration. Under standard assumptions the sub-optimality achieved on iteration k of the ProxFG method with a constant step size is given by 1 f (xk ) − f (x∗ ) = O( ). k When f is strongly-convex, the error satisﬁes [11] L − μg k f (xk ) − f (x∗ ) = O( ), L + μh where L is the Lipschitz constant of f (x), μg , and μh are the convexity parameters of g(x) and h(x) respectively. These notations mentioned here will be detailed in Section 1.1. This result in a linear convergence rate, which is also known as a geometric or exponential rate because the error is cut by a ﬁxed fraction on each iteration. Unfortunately, the ProxFG methods can be unappealing when n is large or huge because its iteration cost scales linearly in n. When the number of components n is very large, then each iteration of (2) will be very expensive since it requires computing the gradients for all the n component functions gi , and also their average. To overcome this problem, researchers proposed the proximal stochastic gradient descent methods (ProxSGD), wh

Data Loading...

Large Scale Optimization with Proximal Stochastic Newton-Type Gradient Descent

Recommend Documents

Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods

Bi-fidelity stochastic gradient descent for structural optimization under uncertainty

Convergence of Stochastic Proximal Gradient Algorithm

Accelerated and Unaccelerated Stochastic Gradient Descent in Model Generality

An Improved Privacy-Preserving Stochastic Gradient Descent Algorithm

Backtracking Gradient Descent Method and Some Applications in Large Scale Optimisation. Part 2: Algorithms and Experimen

Large Scale Unconstrained Optimization

A Modified Descent Spectral Conjugate Gradient Method for Unconstrained Optimization

Large-Scale and Distributed Optimization

Quantitative Stochastic Homogenization and Large-Scale Regularity

Aggregation in Large-Scale Optimization

Linear convergence of inexact descent method and inexact proximal gradient algorithms for lower-order regularization pro