Improving the Accuracy of the KNN Method When Using an Even Number K of Neighbors

The kNN (k Nearest Neighbors) method is a classification method that could show low accuracy figures for even values of k. This paper details one method to improve the accuracy of the kNN method for those cases. It also shows one method that could improve

  • PDF / 848,909 Bytes
  • 7 Pages / 595.276 x 790.866 pts Page_size
  • 35 Downloads / 229 Views

DOWNLOAD

REPORT


Abstract

The kNN (k Nearest Neighbors) method is a classification method that could show low accuracy figures for even values of k. This paper details one method to improve the accuracy of the kNN method for those cases. It also shows one method that could improve the accuracy of it for biased classification sets and for odd values of k.

1

Introduction

There are many machine learning methods for data classification [1], and several of them have been used for breast cancer prognosis and diagnosis [2–5]. This paper deals with the improvement of the kNN (k-Nearest Neighbor) method [6]. It is a nonparametric classification method that has high accuracy. It has even been used to detect different stages of breast cancer [7, 8]. We have conducted research on the details of its use for breast cancer diagnosis and prognosis [9, 10]. When using the kNN method for breast cancer prognosis we have found that it has low accuracy when the number of neighbors k is small and even. When k takes an even value there are some cases where the data used for classification splits in equal sized groups and the class is not chosen by majority, but by the order of the groups used to determine the class. To improve the accuracy in those cases we have devised a method that is very effective for small even values of k. It is detailed in the following section. We show in section III the results of applying it to the breast cancer prognosis data of UCI [11]. A. P. Pawlovsky (&) Faculty of Biomedical Engineering, Department of Clinical Engineering, Toin University of Yokohama, Kanagawa, Japan e-mail: [email protected] D. Kurematsu Department of Clinical Engineering, Toin University of Yokohama, Kanagawa, Japan

When evaluating the kNN method the target data is splitted into two sets. One is used for classification and the other one as a test set. Usually the test set is randomly chosen in a typical implementation and this way of forming it can give place to classification sets that contain data of only one class. This type of sets will completely bias the classification process and will also lower the average accuracy of the method when evaluating it. We also researched on the use of an implementation that helps avoid these cases. We also shown some details of its implementation and its combination with the method to improve the accuracy for even values of k.

2

The KNN Method

The k-Nearest Neighbor (kNN) method is an unsupervised nonparametric machine learning method used for classification tasks. To evaluate the accuracy of it and other classification algorithms we usually divide an already classified data in two sets. One is used for the classification task and the other one is used as a test set. Then one datum at a time is taken from the test set and compared with the data in the classification set. In Fig. 1 we show an example where an unclassified data is brought to the classification set and the similarity of it to the surrounding ones is measured. Similarity is usually measured using the Euclidean distance between data, but other distances can a