Alternating maximization: unifying framework for 8 sparse PCA formulations and efficient parallel codes

  • PDF / 2,768,540 Bytes
  • 27 Pages / 439.37 x 666.142 pts Page_size
  • 69 Downloads / 206 Views

DOWNLOAD

REPORT


Alternating maximization: unifying framework for 8 sparse PCA formulations and efficient parallel codes Peter Richtárik1 · Majid Jahani2 · Selin Damla Ahipaşaoğlu3 · Martin Takáč2 Received: 30 October 2019 / Revised: 29 April 2020 / Accepted: 7 September 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Given a multivariate data set, sparse principal component analysis (SPCA) aims to extract several linear combinations of the variables that together explain the variance in the data as much as possible, while controlling the number of nonzero loadings in these combinations. In this paper we consider 8 different optimization formulations for computing a single sparse loading vector: we employ two norms for measuring variance (L2, L1) and two sparsity-inducing norms (L0, L1), which are used in two ways (constraint, penalty). Three of our formulations, notably the one with L0 constraint and L1 variance, have not been considered in the literature. We give a unifying reformulation which we propose to solve via the alternating maximization (AM) method. We show that AM is equivalent to GPower for all formulations. Besides this, we provide 24 efficient parallel SPCA implementations: 3 codes (multi-core, GPU and cluster) for each of the 8 problems. Parallelism in the methods is aimed at (1) speeding up computations (our GPU code can be 100 times faster than an efficient serial code written in C++), (2) obtaining solutions explaining more variance and (3) dealing with big data problems (our cluster code can solve a 357 GB problem in a minute). Keywords  Sparse PCA · Alternating maximization · GPower · Big data analytics · Unsupervised learning

1 Introduction Principal component analysis (PCA) is an indispensable tool used for dimension reduction in virtually all areas of science and engineering, from machine learning, statistics, genetics and finance to computer networks (Jollife 1986). Let A ∈ 𝐑n×p MT was partially supported by National Science Foundation Grants CCF-1618717, CMMI-1663256 and CCF-1740796. * Martin Takáč [email protected] Extended author information available on the last page of the article

13

Vol.:(0123456789)



P. Richtárik et al.

denote a data matrix encoding n samples (observations) of p variables (features). PCA aims to extract a few linear combinations of the columns of A, called principal components (PCs), pointing in mutually orthogonal directions, together explaining as much variance in the data as possible. If the columns of A are centered, the problem of extracting the first PC can be written as

max{‖Ax‖ ∶ ‖x‖2 ≤ 1},

(1.1)

where ‖ ⋅ ‖ is a suitable norm for measuring variance. The solution x of this optimization problem is called the loading vector, Ax (normalized) is the first PC. Further PCs can be obtained in the same way with A replaced by a new matrix in a process called deflation (Mackey 2008). Classical PCA employs the L2 norm in the objective; using the L1 norm instead may alleviate problems caused by outliers in the data and hence leads to a robus