Influence Diagnostics in Support Vector Machines

  • PDF / 991,769 Bytes
  • 22 Pages / 439.37 x 666.142 pts Page_size
  • 70 Downloads / 202 Views

DOWNLOAD

REPORT


Online ISSN 2005-2863 Print ISSN 1226-3192

RESEARCH ARTICLE

Influence Diagnostics in Support Vector Machines Sunwha Kim1 · Choongrak Kim2 Received: 8 May 2019 / Accepted: 23 October 2019 © Korean Statistical Society 2020

Abstract Support vector machines (SVM) is very efficient and popular tool for classification, however, its non-robustness to outliers is a critical drawback. In fact, SVM is more sensitive to outliers than other classifiers since the optimal separating hyperplane obtained by SVM is solely determined by support vectors. So far all the studies about outliers in SVM are done by trying to minimize the effect of outliers by specifying robust loss functions. In this paper, we propose a version of Cook’s distance based on the deletion method and the infinitesimal perturbation method. Also, we express the Cook’s distance in terms of basic building blocks such as residual and leverage. Further, we propose a simple measure which can be used as either descriptive statistics in SVM diagnostics or approximate measure when the Cook’s distance cannot be computed due to the high-dimensionality. Keywords Cook’s distance · Kernel function · Influence measure · Outlier

1 Introduction Support vector machines (SVM) are very efficient and popular tool for classification. The idea of perceptron (Rosenblatt 1958) was extended to the optimal separating hyperplane by Vapnik and Chervonenkis (1964), and the name Support Vector was explicitly used for the first time by Cortes and Vapnik (1995). There are lots of useful properties in SVM (see Vapnik (1998), for example), and two important aspects can be mentioned as follows. First, SVM can handle high-dimensional data, and its prediction results are very reliable. Under the high dimensionality, SVM was applied in various fields, such as face identification and recognition (Guo et al. 2000), text categoriza-

Electronic supplementary material The online version of this article (https://doi.org/10.1007/s42952019-00037-5) contains supplementary material, which is available to authorized users.

B

Choongrak Kim [email protected]

1

Cardiovascular Center, Seoul National University Bundang Hospital, Seongnam, South Korea

2

Department of Statistics, Pusan National University, Busan, South Korea

123

Journal of the Korean Statistical Society

tion (Siolas and d’Alche-Buc 2000), biological and medical aid (Guyon et al. 2002; Pochet et al. 2004; Akay 2009) and so on. Second, SVM also handles the possible nonlinearity inherent in data by incorporating kernel functions (Scholkopf and Smola 2002). Moguerza and Munoz (2006) is a good review paper, and Vapnik (1998) is an excellent book among others. One potential drawback of SVM is the non-robustness to outliers, i.e, SVM could be more sensitive to outliers than other classifiers since the optimal separating hyperplane obtained by SVM is solely determined by, so called, the support vectors, and the size of support vectors are usually much less than that of whole sample. Therefore, identifications of outliers are very important iss