The exact equivalence of distance and kernel methods in hypothesis testing

  • PDF / 1,488,490 Bytes
  • 19 Pages / 439.37 x 666.142 pts Page_size
  • 86 Downloads / 167 Views

DOWNLOAD

REPORT


The exact equivalence of distance and kernel methods in hypothesis testing Cencheng Shen1   · Joshua T. Vogelstein2,3 Received: 12 February 2020 / Accepted: 17 September 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract Distance correlation and Hilbert-Schmidt independence criterion are widely used for independence testing, two-sample testing, and many inference tasks in statistics and machine learning. These two methods are tightly related, yet are treated as two different entities in the majority of existing literature. In this paper, we propose a simple and elegant bijection between metric and kernel. The bijective transformation better preserves the similarity structure, allows distance correlation and Hilbert-Schmidt independence criterion to be always the same for hypothesis testing, streamlines the code base for implementation, and enables a rich literature of distance-based and kernel-based methodologies to directly communicate with each other. Keywords  Distance covariance · Hilbert-Schmidt independence criterion · Strong negative-type metric · Characteristic kernel

1 Introduction Distance correlation is a distance-based method initially proposed for testing independence (Szekely et  al. 2007; Szekely and Rizzo 2009). It can be used in two-sample test (Rizzo and Szekely 2016; Panda et  al. 2020), conditional independence (Szekely and Rizzo 2014; Wang et  al. 2015), feature screening (Li et  al. 2012; Zhong and Zhu 2015; Wang et  al. 2019), clustering (Szekely * Cencheng Shen [email protected] Joshua T. Vogelstein [email protected] 1

Department of Applied Economics and Statistics, University of Delaware, Newark, USA

2

Institute for Computational Medicine, Johns Hopkins University, Baltimore, USA

3

Department of Biomedical Engineering and Institute of Computational Medicine, Johns Hopkins University, Baltimore, USA



13

Vol.:(0123456789)



C. Shen, J. T. Vogelstein

and Rizzo 2005; Rizzo and Szekely 2010), time-series testing (Zhou 2012; Fokianos and Pitsillou 2018; Mehta et  al. 2020), graph dependence  (Lee et  al. 2019; Xiong et al. 2020). The Hilbert-Schmidt independence criterion is a kernel-based method for testing independence and equally popular in related inference tasks (Gretton et al. 2005; Fukumizu et al. 2007; Song et al. 2007; Gretton and Gyorfi 2010; Balasubramanian et al. 2013; Chang et al. 2013; Zhang et al. 2018). These two foundational methods are universally consistent for testing independence, which motivated many other consistent methods with improved finite-sample power (Heller et al. 2013, 2016; Vogelstein et al. 2019; Shen et al. 2020; Zhu et al. 2017; Pan et al. 2018; Kim et al. 2018; Shen 2020). The distance-based and kernel-based methods share similar formulations and many common properties. The independence hypothesis is formulated as follows: given paired sample data {(xi , yi ) ∈ ℝp+q , i = 1, … , n} where p denotes the dimension of xi and q denotes the dimension of yi , we aim to test

H0 ∶ FXY = FX FY , HA ∶ FXY ≠ FX FY iid

by assuming {(xi , yi