Variable Selection for Mixed Data Clustering: Application in Human Population Genomics

  • PDF / 604,380 Bytes
  • 19 Pages / 439.642 x 666.49 pts Page_size
  • 97 Downloads / 179 Views

DOWNLOAD

REPORT


Variable Selection for Mixed Data Clustering: Application in Human Population Genomics Matthieu Marbac1 · Mohammed Sedki2 · Tienne Patin3

© The Classification Society 2019

Abstract Model-based clustering of human population genomic data, composed of 1,318 individuals arisen from western Central Africa and 160,470 markers, is considered. This challenging analysis leads us to develop a new methodology for variable selection in clustering. To explain the differences between subpopulations and to increase the accuracy of the estimates, variable selection is done simultaneously to clustering. We proposed two approaches for selecting variables when clustering is managed by the latent class model (i.e., mixture considering independence within components). The first method simultaneously performs model selection and parameter inference. It optimizes the Bayesian Information Criterion with a modified version of the standard expectation–maximization algorithm. The second method performs model selection without requiring parameter inference by maximizing the Maximum Integrated Complete-data Likelihood criterion. Although the application considers categorical data, the proposed methods are introduced in the general context of mixed data (data composed of different types of features). As the first step, the interest of both proposed methods is shown on simulated and several benchmark real data. Then, we apply the clustering method to the human population genomic data which permits to detect the most discriminative genetic markers. The proposed method implemented in the R package VarSelLCM is available on CRAN. Keywords Human evolutionary genetics · Information criterion · Mixed data · Model-based clustering · Variable selection

1 Introduction Clustering (Kettenring 2006) allows summarizing large datasets by grouping observations into few homogeneous classes. It is regularly used in several emerging branches of science, such as functional, ecological, and population genomics (Lawson and Falush 2012; Ronan  Matthieu Marbac

[email protected] 1

CREST, Ensai, Bruz, France

2

UMR Inserm-1181, University of Paris-Sud, Orsay, France

3

CNRS URA3012, Institut Pasteur, Paris, France

Journal of Classification

et al. 2016). The paper focuses on clustering of a dataset composed of 160,470 markers (categorical variables with three levels) and 1,318 individuals from 35 human populations of western Central Africa (Patin et al. 2017). This analysis is done with a finite mixture model (McLachlan and Peel 2000; McNicholas 2016a, b) which formulates the unknown partition among observations, in a probabilistic framework. Because the partition can be explained by only a subset of variables, a selection of variables in clustering (Biernacki and MaugisRabusseau 2015; Fop et al. 2017) is considered. This selection facilitates the accuracy of model fitting (especially important due to the data dimension). Moreover, it emphasizes the subset of markers which explains the differences between subpopulations. This paper focuses on var