Adaptive Stochastic Primal-Dual Coordinate Descent for Separable Saddle Point Problems

We consider a generic convex-concave saddle point problem with a separable structure, a form that covers a wide-ranged machine learning applications. Under this problem structure, we follow the framework of primal-dual updates for saddle point problems, a

  • PDF / 548,882 Bytes
  • 14 Pages / 439.37 x 666.142 pts Page_size
  • 4 Downloads / 186 Views

DOWNLOAD

REPORT


Abstract. We consider a generic convex-concave saddle point problem with a separable structure, a form that covers a wide-ranged machine learning applications. Under this problem structure, we follow the framework of primal-dual updates for saddle point problems, and incorporate stochastic block coordinate descent with adaptive stepsizes into this framework. We theoretically show that our proposal of adaptive stepsizes potentially achieves a sharper linear convergence rate compared with the existing methods. Additionally, since we can select “mini-batch” of block coordinates to update, our method is also amenable to parallel processing for large-scale data. We apply the proposed method to regularized empirical risk minimization and show that it performs comparably or, more often, better than state-of-the-art methods on both synthetic and real-world data sets. Keywords: Large-scale optimization · Parallel optimization · Stochastic coordinate descent · Convex-concave saddle point problems

1

Introduction

The generic convex-concave saddle point problem is written as min max {L(x, y) = g(x) + x, Ky − φ∗ (y)} ,

x∈Rd y∈Rq

(1)

where g(x) is a proper convex function, φ∗ (·) is the convex conjugate of a convex function φ(·), and matrix K ∈ Rd×q . Many machine learning tasks reduce to solving a problem of this form [3,6]. As a result, this saddle problem has been widely studied [1,2,4,5,14,16]. One important subclass of the general convex concave saddle point problem ∗ is where g(x) or φ∗ (y) exhibitsan additive separable structure. We n say φ (y) n q = q. is separable when φ∗ (y) = n1 i=1 φ∗i (yi ), with yi ∈ Rqi , and i=1 i Separability for g(·) is defined likewise. To keep the consistent notation for the machine learning applications discussed later, we introduce matrix A and let c Springer International Publishing Switzerland 2015  A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 645–658, 2015. DOI: 10.1007/978-3-319-23528-8 40

646

Z. Zhu and A.J. Storkey

K = n1 A. Then we partition matrix A into n column blocks Ai ∈ Rd×qi , i = n 1, . . . , n, and Ky = n1 i=1 Ai yi , resulting in a problem of the form   n 1 ∗ min max L(x, y) = g(x) + (x, Ai yi  − φi (yi )) (2) n i=1 x∈Rd y∈Rq for φ∗ (·) separable. We call any problem of the form (1) where g(·) or φ∗ (·) has separable structure, a Separable Convex Concave Saddle Point (Sep-CCSP ) problem. Eq. (2) gives the explicit form for when φ∗ (·) is separable. In this work, we further assume that each φ∗i (yi ) is γ-strongly convex, and g(x) is λ-strongly convex, i.e., φ∗i (yi ) ≥ φ∗i (yi ) + ∇φ∗ (yi )T (yi − yi ) + g(x ) ≥ g(x) + ∇g(x)T (x − x) +

γ  y − yi 22 , 2 i

λ  x − xi 22 , 2 i

∀yi , yi ∈ Rqi

∀x, x ∈ Rd ,

where we use ∇ to denote both the gradient for smooth function and subgradient for non-smooth function. When the strong convexity cannot be satisfied, a small strongly convex perturbation can be added to make the problem satisfy the assumption [15]. One important instantiation of the Sep-CCSP problem in machine learning is t