So you think you can PLS-DA?

PDF / 1,450,516 Bytes
10 Pages / 595.276 x 790.866 pts Page_size
87 Downloads / 201 Views

RESEARCH

Open Access

So you think you can PLS-DA? Daniel Ruiz-Perez1 , Haibin Guan1 , Purnima Madhivanan2 , Kalai Mathee3 and Giri Narasimhan1* From 8th IEEE International Conference on Computational Advances in Bio and medical Sciences (ICCABS 2018) Las Vegas, NV, USA, 18-20 October 2018

Abstract Background: Partial Least-Squares Discriminant Analysis (PLS-DA) is a popular machine learning tool that is gaining increasing attention as a useful feature selector and classifier. In an effort to understand its strengths and weaknesses, we performed a series of experiments with synthetic data and compared its performance to its close relative from which it was initially invented, namely Principal Component Analysis (PCA). Results: We demonstrate that even though PCA ignores the information regarding the class labels of the samples, this unsupervised tool can be remarkably effective as a feature selector. In some cases, it outperforms PLS-DA, which is made aware of the class labels in its input. Our experiments range from looking at the signal-to-noise ratio in the feature selection task, to considering many practical distributions and models encountered when analyzing bioinformatics and clinical data. Other methods were also evaluated. Finally, we analyzed an interesting data set from 396 vaginal microbiome samples where the ground truth for the feature selection was available. All the 3D figures shown in this paper as well as the supplementary ones can be viewed interactively at http://biorg.cs.fiu.edu/plsda Conclusions: Our results highlighted the strengths and weaknesses of PLS-DA in comparison with PCA for different underlying data models. Keywords: PLS-DA, PCA, Feature selection, Dimensionality reduction, Bioinformatics

Background Partial Least-Squares Discriminant Analysis (PLS-DA) is a multivariate dimensionality-reduction tool [1, 2] that has been popular in the field of chemometrics for well over two decades [3], and has been recommended for use in omics data analyses. PLS-DA is gaining popularity in metabolomics and in other integrative omics analyses [4–6]. Both chemometrics and omics data sets are characterized by large volume, large number of features, noise and missing data [2, 7]. These data sets also often have lot fewer samples than features. PLS-DA can be thought of as a “supervised” version of Principal Component Analysis (PCA) in the sense that it achieves dimensionality reduction but with full awareness of the class labels. Besides its use for dimensionality-reduction, it can be adapted *Correspondence: [email protected] Bioinformatics Research Group (BioRG), Florida International University, 11200 SW 8th St, 33199, Miami, FL USA Full list of author information is available at the end of the article 1

to be used for feature selection [8] as well as for classification [9–11]. As its popularity grows, it is important to note that its role in discriminant analysis can be easily misused and misinterpreted [2, 12]. Since it is prone to overfitting, cross-validation (CV) is an important step in using P

Data Loading...

So you think you can PLS-DA?

Recommend Documents

Can You Identify the Microstructure?

Can You Identify the Microstructure?

Can You Identify the Microstructure?

Do You Think You Can? The Influence of Student Self-Efficacy on the Effectiveness of Tutorial Dialogue for Computer Scie

What You Can Achieve With TBM

Nothing in biology begins when you think it does

Thank You

So You Want to Develop an App for Radiology Education? What You Need to Know to Be Successful

Can you hear the writing on the wall?

Imaging in CTO: Should you look before you open?

I Kant Believe You

Finally, A Windows Terminal That You Can Customize