An empirical comparison of two approaches for CDPCA in high-dimensional data
- PDF / 1,077,298 Bytes
- 25 Pages / 439.37 x 666.142 pts Page_size
- 80 Downloads / 161 Views
An empirical comparison of two approaches for CDPCA in high‑dimensional data Adelaide Freitas1,2 · Eloísa Macedo3 · Maurizio Vichi4 Accepted: 7 August 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020
Abstract Modified principal component analysis techniques, specially those yielding sparse solutions, are attractive due to its usefulness for interpretation purposes, in particular, in high-dimensional data sets. Clustering and disjoint principal component analysis (CDPCA) is a constrained PCA that promotes sparsity in the loadings matrix. In particular, CDPCA seeks to describe the data in terms of disjoint (and possibly sparse) components and has, simultaneously, the particularity of identifying clusters of objects. Based on simulated and real gene expression data sets where the number of variables is higher than the number of the objects, we empirically compare the performance of two different heuristic iterative procedures, namely ALS and twostep-SDP algorithms proposed in the specialized literature to perform CDPCA. To avoid possible effect of different variance values among the original variables, all the data was standardized. Although both procedures perform well, numerical tests highlight two main features that distinguish their performance, in particular related to the two-step-SDP algorithm: it provides faster results than ALS and, since it employs a clustering procedure (k-means) on the variables, outperforms ALS algorithm in recovering the true variable partitioning unveiled by the generated data sets. Overall, both procedures produce satisfactory results in terms of solution precision, where ALS performs better, and in recovering the true object clusters, in which two-step-SDP outperforms ALS approach for data sets with lower sample size and more structure complexity (i.e., error level in the CDPCA model). The proportion of explained variance by the components estimated by both algorithms is affected by the data structure complexity (higher error level, the lower variance) and presents similar values for the two algorithms, except for data sets with two object clusters where the two-step-SDP approach yields higher variance. Moreover, experimental tests suggest that the two-step-SDP approach, in general, presents more ability to recover the true number of object clusters, while the ALS algorithm is better in terms of quality of object clustering with more homogeneous, compact and wellseparated clusters in the reduced space of the CDPCA components. Electronic supplementary material The online version of this article (https://doi.org/10.1007/s1026 0-020-00546-2) contains supplementary material, which is available to authorized users. Extended author information available on the last page of the article
13
Vol.:(0123456789)
A. Freitas et al.
Keywords Principal component analysis · Clustering of objects · Partitioning of attributes · Semidefinite programming
1 Introduction Ever-increasing problem size demands the development of novel techniques to perform statistical analysis
Data Loading...