A distance based multisample test for high-dimensional compositional data with applications to the human microbiome

PDF / 2,333,430 Bytes
17 Pages / 595.276 x 790.866 pts Page_size
66 Downloads / 266 Views

MET HODOLOGY

Open Access

A distance based multisample test for high-dimensional compositional data with applications to the human microbiome Qingyang Zhang*

and Thy Dao

From The 20th International Conference on Bioinformatics & Computational Biology (BIOCOMP 2019) Las Vegas, NV, USA. 29 July–01 August 2019 *Correspondence: [email protected] Department of Mathematical Sciences, University of Arkansas, Fayetteville, AR 72701, USA

Abstract Background: Compositional data refer to the data that lie on a simplex, which are common in many scientific domains such as genomics, geology and economics. As the components in a composition must sum to one, traditional tests based on unconstrained data become inappropriate, and new statistical methods are needed to analyze this special type of data. Results: In this paper, we consider a general problem of testing for the compositional difference between K populations. Motivated by microbiome and metagenomics studies, where the data are often over-dispersed and high-dimensional, we formulate a well-posed hypothesis from a Bayesian point of view and suggest a nonparametric test based on inter-point distance to evaluate statistical significance. Unlike most existing tests for compositional data, our method does not rely on any data transformation, sparsity assumption or regularity conditions on the covariance matrix, but directly analyzes the compositions. Simulated data and two real data sets on the human microbiome are used to illustrate the promise of our method. Conclusions: Our simulation studies and real data applications demonstrate that the proposed test is more sensitive to the compositional difference than the mean-based method, especially when the data are over-dispersed or zero-inflated. The proposed test is easy to implement and computationally efficient, facilitating its application to large-scale datasets. Keywords: Microbiome, Compositional data, High dimensionality, Centered log-ratio transformation, Multisample test, Distance correlation Background

Data that lie on the simplex S d−1 = (x1 , x2 , ..., xd ), s.t. minj xj ≥ 0, dj=1 xj = 1 are often called (d − 1)-dimensional compositional data, and they arise in many scientific disciplines such as genomics, geology and economics [1–3]. As the components in a composition must sum to one, classic statistical tests including two-sample t-test and

© The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory

Data Loading...

A distance based multisample test for high-dimensional compositional data with applications to the human microbiome

Recommend Documents

Analyzing Compositional Data with R

A novel downscaling procedure for compositional data in the Aitchison geometry with application to soil texture data

Statistical Analysis of Microbiome Data with R

Correction to: Visual exploration of microbiome data

Uncertain distance-based outlier detection with arbitrarily shaped data objects

Convex clustering method for compositional data modeling

Inverse Distance Aggregation for Federated Learning with Non-IID Data

Visual exploration of microbiome data

Data Fusion of Small Sample Flying Test Data and Big Sample Simulation Test Data Based on Equivalent Sample for Equipmen

Feature screening based on distance correlation for ultrahigh-dimensional censored data with covariate measurement error

Web-Based Data Analysis and Feedback for General Chemistry Laboratory: Improving Analysis with Timely, Distance Feedback

Applications to Proxy Data