An empirical comparison of two approaches for CDPCA in high-dimensional data

PDF / 1,077,298 Bytes
25 Pages / 439.37 x 666.142 pts Page_size
80 Downloads / 274 Views

An empirical comparison of two approaches for CDPCA in high‑dimensional data Adelaide Freitas1,2 · Eloísa Macedo3 · Maurizio Vichi4 Accepted: 7 August 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract Modified principal component analysis techniques, specially those yielding sparse solutions, are attractive due to its usefulness for interpretation purposes, in particular, in high-dimensional data sets. Clustering and disjoint principal component analysis (CDPCA) is a constrained PCA that promotes sparsity in the loadings matrix. In particular, CDPCA seeks to describe the data in terms of disjoint (and possibly sparse) components and has, simultaneously, the particularity of identifying clusters of objects. Based on simulated and real gene expression data sets where the number of variables is higher than the number of the objects, we empirically compare the performance of two different heuristic iterative procedures, namely ALS and twostep-SDP algorithms proposed in the specialized literature to perform CDPCA. To avoid possible effect of different variance values among the original variables, all the data was standardized. Although both procedures perform well, numerical tests highlight two main features that distinguish their performance, in particular related to the two-step-SDP algorithm: it provides faster results than ALS and, since it employs a clustering procedure (k-means) on the variables, outperforms ALS algorithm in recovering the true variable partitioning unveiled by the generated data sets. Overall, both procedures produce satisfactory results in terms of solution precision, where ALS performs better, and in recovering the true object clusters, in which two-step-SDP outperforms ALS approach for data sets with lower sample size and more structure complexity (i.e., error level in the CDPCA model). The proportion of explained variance by the components estimated by both algorithms is affected by the data structure complexity (higher error level, the lower variance) and presents similar values for the two algorithms, except for data sets with two object clusters where the two-step-SDP approach yields higher variance. Moreover, experimental tests suggest that the two-step-SDP approach, in general, presents more ability to recover the true number of object clusters, while the ALS algorithm is better in terms of quality of object clustering with more homogeneous, compact and wellseparated clusters in the reduced space of the CDPCA components. Electronic supplementary material The online version of this article (https://doi.org/10.1007/s1026 0-020-00546-2) contains supplementary material, which is available to authorized users. Extended author information available on the last page of the article

13

Vol.:(0123456789)

A. Freitas et al.

Keywords Principal component analysis · Clustering of objects · Partitioning of attributes · Semidefinite programming

1 Introduction Ever-increasing problem size demands the development of novel techniques to perform statistical analysis

Data Loading...

An empirical comparison of two approaches for CDPCA in high-dimensional data

Recommend Documents

Evaluating the impact of CASE: an empirical comparison of retrospective and cross-sectional survey approaches

Comparison of topological, empirical and optimization-based approaches for locating quality detection points in water di

An Empirical Comparison of Exploratory Versus Conventional Structural Equation Modelling

An Empirical Comparison of Global and Local Functional Depths

Algebraic Predicates for Empirical Data

Two Islands in Comparison

Unconditional Exact Tests for Dichotomous Data in the Comparison of Two Treatments with One Control Group

Empirical Data for Pedestrian Flow Through Bottlenecks

Empirical Comparison of Graph Embeddings for Trust-Based Collaborative Filtering

Performance comparison of multi-container deployment schemes for HPC workloads: an empirical study

A Comparison of Empirical Tree Entropies

Performance Comparison of Two Electronic Controllers on an ARM Platform