Large-Scale Simultaneous Testing Using Kernel Density Estimation

PDF / 1,083,094 Bytes
36 Pages / 439.37 x 666.142 pts Page_size
43 Downloads / 219 Views

Large-Scale Simultaneous Testing Using Kernel Density Estimation Santu Ghosh Augusta University, Augusta, USA

Alan M. Polansky Northern Illinois University, DeKalb, USA

Abstract A century ago, when Student’s t-statistic was introduced, no one ever imagined its increasing applicability in the modern era. It ﬁnds applications in highly multiple hypothesis testing, feature selection and ranking, high dimensional signal detection, etc. Student’s t-statistic is constructed based on the empirical distribution function (EDF). An alternative choice to the EDF is the kernel density estimate (KDE), which is a smoothed version of the EDF. The novelty of the work consists of an alternative to Student’s t-test that uses the KDE technique and exploration of the usefulness of KDE based t-test in the context of its application to large-scale simultaneous hypothesis testing. An optimal bandwidth parameter for the KDE approach is derived by minimizing the asymptotic error between the true p-value and its asymptotic estimate based on normal approximation. If the KDE-based approach is used for large-scale simultaneous testing, then it is interesting to consider, when does the method fail to manage the error rate? We show that the suggested KDE-based method can control false discovery rate (FDR) if total number tests diverge at a smaller order of magnitude than N 3/2 , where N is the total sample size. We compare our method to several possible alternatives with respect to FDR. We show in simulations that our method produces a lower proportion of false discoveries than its competitors. That is, our method better controls the false discovery rate than its competitors. Through these empirical studies, it is shown that the proposed method can be successfully applied in practice. The usefulness of the proposed methods is further illustrated through a gene expression data example. AMS (2000) subject classiﬁcation. Primary 62F03; Secondary 62G10. Keywords and phrases. Two-sample t-test, Kernel density estimator, Edgeworth expansion, False discovery rate

2

S. Ghosh and A. M. Polansky

1 Introduction Modern high-throughput technologies generate large amounts of data in plenty. Microarray technology is a classic example where each subject is measured on thousands or tens of thousands of standard features. For such data sets, we are often interested in performing large-scale simultaneous signiﬁcance testing, where a common objective is to ﬁnd features that exhibit diﬀerentially from the others with respect to mean parameters. For example, genomic technologies including microarrays and RNA-seq allow the monitoring of expression levels in cells for thousands of genes simultaneously in cancerous and normal organ tissues. Lists of diﬀerentially expressed genes between malignant and non-malignant states can be fertile sources of cancer biomarkers. In this particular example, the biological question of diﬀerential expression can be treated as a two-sample multiple hypothesis testing problem for the mean parameters. To ﬁx the conventional framework

Data Loading...

Large-Scale Simultaneous Testing Using Kernel Density Estimation

Recommend Documents

Kernel Circular Deconvolution Density Estimation

Efficient Multi-frequency Phase Unwrapping Using Kernel Density Estimation

Optimal Kernel Selection for Density Estimation

Density Estimation Using Multiscale Local Polynomial Transforms

Near-Optimal Coresets of Kernel Density Estimates

Computational complexity of kernel-based density-ratio estimation: a condition number analysis

Fast feature selection for interval-valued data through kernel density estimation entropy

New type of gamma kernel density estimator

Distribution and Density Estimation

Univariate Density Estimation

TADE: Stochastic Conformance Checking Using Temporal Activity Density Estimation

New multivariate kernel density estimator for uncertain data classification