Partial Least Squares for Heterogeneous Data

Large-scale data, where the sample size and the dimension are high, often exhibits heterogeneity. This can arise for example in the form of unknown subgroups or clusters, batch effects or contaminated samples. Ignoring these issues would often lead to poo

  • PDF / 196,918 Bytes
  • 13 Pages / 439.36 x 666.15 pts Page_size
  • 69 Downloads / 288 Views

DOWNLOAD

REPORT


Partial Least Squares for Heterogeneous Data Peter Bühlmann

Abstract Large-scale data, where the sample size and the dimension are high, often exhibits heterogeneity. This can arise for example in the form of unknown subgroups or clusters, batch effects or contaminated samples. Ignoring these issues would often lead to poor prediction and estimation. We advocate the maximin effects framework (Meinshausen and Bühlmann, Maximin effects in inhomogeneous large-scale data. Preprint arXiv:1406.0596, 2014) to address the problem of heterogeneous data. In combination with partial least squares (PLS) regression, we obtain a new PLS procedure which is robust and tailored for large-scale heterogeneous data. A small empirical study complements our exposition of new PLS methodology. Keywords Partial least square regression (PLSR) • Heterogeneous data • Big data • Minimax • Maximin

1.1 Introduction Large-scale complex data, where the the total sample size n and the number of variables p (i.e., the “dimension”) are large, arise in many areas in science. For the case with high dimensions, regularized estimation schemes have become popular and are well-established (cf. Hastie et al. 2009; Bühlmann and van de Geer 2011). Partial least squares (PLS) (Wold 1966) is an interesting procedure and is widely used in many applications: besides good prediction performance, with its “vague similarity” to Ridge regression (Frank and Friedman 1993), and usefulness for dimensionality reduction, it is computationally attractive for large-scale problems as it operates in an iterative fashion based on empirical covariances only (Geladi and Kowalski 1986; Esposito Vinzi et al. 2010). When the total sample size n is large, as in “big data” problems, we typically expect that the observations are heterogeneous and not i.i.d. or stationary realizations from a single probability distribution. Ignoring such heterogeneity

P. Bühlmann () Seminar for Statistics, ETH Zurich, Zürich, Switzerland e-mail: [email protected] © Springer International Publishing Switzerland 2016 H. Abdi et al. (eds.), The Multiple Facets of Partial Least Squares and Related Methods, Springer Proceedings in Mathematics & Statistics 173, DOI 10.1007/978-3-319-40643-5_1

3

4

P. Bühlmann

(e.g., unknown subpopulations, batch and clustering effects, or outliers) is likely to produce poor predictions and estimation. Classical approaches to address these issues include robust methods (Huber 2011), varying coefficient models (Hastie and Tibshirani 1993), mixed effects models (Pinheiro and Bates 2000) or mixture models (McLachlan and Peel 2004). Mostly for computational reasons with largescale data, we aim for methods which are computationally efficient with a structure allowing for simple parallel processing. This can be achieved with a so-called maximin effects approach (Meinshausen and Bühlmann 2015) and its corresponding subsampling and aggregation “magging” procedure (Bühlmann and Meinshausen 2016). As we will discuss, the computational efficiency of partial least square