Bias-corrected support vector machine with Gaussian kernel in high-dimension, low-sample-size settings

  • PDF / 1,832,975 Bytes
  • 30 Pages / 439.37 x 666.142 pts Page_size
  • 82 Downloads / 204 Views

DOWNLOAD

REPORT


Bias-corrected support vector machine with Gaussian kernel in high-dimension, low-sample-size settings Yugo Nakayama1 · Kazuyoshi Yata2 · Makoto Aoshima2 Received: 28 September 2018 / Revised: 15 April 2019 © The Institute of Statistical Mathematics, Tokyo 2019

Abstract In this paper, we study asymptotic properties of nonlinear support vector machines (SVM) in high-dimension, low-sample-size settings. We propose a bias-corrected SVM (BC-SVM) which is robust against imbalanced data in a general framework. In particular, we investigate asymptotic properties of the BC-SVM having the Gaussian kernel and compare them with the ones having the linear kernel. We show that the performance of the BC-SVM is influenced by the scale parameter involved in the Gaussian kernel. We discuss a choice of the scale parameter yielding a high performance and examine the validity of the choice by numerical simulations and actual data analyses. Keywords Geometric representation · HDLSS · Imbalanced data · Radial basis function kernel

We are very grateful to the associate editor and the reviewer for their constructive comments. The research of the second author was partially supported by Grant-in-Aid for Scientific Research (C), Japan Society for the Promotion of Science (JSPS), under Contract Number 18K03409. The research of the third author was partially supported by Grants-in-Aid for Scientific Research (A) and Challenging Research (Exploratory), JSPS, under Contract Numbers 15H01678 and 17K19956.

B

Makoto Aoshima [email protected] Yugo Nakayama [email protected] Kazuyoshi Yata [email protected]

1

Graduate School of Pure and Applied Sciences, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki 305-8571, Japan

2

Institute of Mathematics, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki 305-8571, Japan

123

Y. Nakayama et al.

1 Introduction A common feature of high-dimensional data is that the data dimension is high; however, the sample size is relatively low. We call such data “HDLSS” data. The current work handles the classification problem in the HDLSS framework. Suppose we have two independent populations, Πi , i = 1, 2, having a d-variate distribution with unknown mean vector μi and unknown covariance matrix Σ i . We do not specify any distributional function for Πi . We have independent and identically distributed (i.i.d.) observations, x i1 , . . . , x in i , from each Πi . We assume n i ≥ 2. Let x 0 be an observation vector of an individual belonging to one of the Πi s. We assume x 0 and x i j s are independent. Let N = n 1 + n 2 . We consider the HDLSS context in which d → ∞ while N is fixed or N /d → 0 as d, N → ∞. In the HDLSS context, Hall et al. (2005), Marron et al. (2007) and Qiao et al. (2010) considered distance weighted classifiers. Hall et al. (2008), Chan and Hall (2009) and Aoshima and Yata (2014) considered distance-based classifiers. Aoshima and Yata (2019) considered a distance-based classifier based on a data transformation technique. Aoshima and Yata (2011, 2015) consi