Estimating total species using a weighted combination of expected mixture distribution component counts

  • PDF / 1,623,317 Bytes
  • 19 Pages / 439.37 x 666.142 pts Page_size
  • 1 Downloads / 197 Views

DOWNLOAD

REPORT


Estimating total species using a weighted combination of expected mixture distribution component counts Konstantin Shestopaloff1   · Wei Xu1,2 · Michael D. Escobar1 Received: 4 October 2019 / Revised: 3 May 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract In this paper we present a weighted mixture distribution component counts (MDCC) approach for estimating total number of species. The proposed method combines conditional estimates of component counts from several candidate mixture distributions and uses bootstrap for confidence interval estimation. The distribution specification is flexible and can be adjusted to suit a variety of datasets. Smoothing techniques can also be incorporated to improve modeling of sparse data. The method is tested by a simulation study and applied to two microbiome datasets for illustration. Simulation results indicate improved bias, mean squared error and confidence interval coverage relative to comparison methods, as well as robustness to underlying data structure. Keywords  Mixture distribution · Statistical ecology · Total species · Unobserved species · Weighted estimator

1 Introduction The problem of total species estimation has existed for many decades, from the early stochastic models of ecological diversity (Yule 1925) to the subsequent formulation of the commonly used stochastic abundance model (Fisher et al. 1943) and the succeeding years which have produced a steady evolution of new and improved estimators. Whether motivated by similar data outside ecology (Good 1953; Good and Toulmin 1956), the advent of new technologies or the emergence of new data, the problem remains relevant to this day. The inherent challenge of estimating an Handling Editor: Pierre Dutilleul. * Konstantin Shestopaloff [email protected] 1

Dalla Lana School of Public Health, University of Toronto, 6th Floor, Health Sciences Building, 155 College Street, Toronto, ON M5T 3M7, Canada

2

Princess Margaret Cancer Centre, University Health Network, Toronto, Canada



13

Vol.:(0123456789)



Environmental and Ecological Statistics

unobserved quantity in sparse and skewed data still leaves room for further improvement and a motivation to continue this research. The variability of existing estimators serves to confirm a persistent interest. Some leverage the inherent structure of ecological communities, while others focus on classical statistical approaches all of which has resulted in an array of parametric, nonparametric, frequentist and Bayesian estimators. In this paper we present a weighted mixture distribution component counts (MDCC) estimator that combines conditional estimates of expected mixture component samples from several candidate mixture distributions. The approach aims to be widely applicable to modern high-throughput data, particularly microbiome data, and is formulated to be robust to different data structures and estimator convergence. The method is made flexible through the specification of candidate mixtures and can also incorporate