Towards enriching the quality of k -nearest neighbor rule for document classification
- PDF / 493,652 Bytes
- 9 Pages / 595.276 x 790.866 pts Page_size
- 97 Downloads / 179 Views
ORIGINAL ARTICLE
Towards enriching the quality of k-nearest neighbor rule for document classification Tanmay Basu • C. A. Murthy
Received: 26 September 2012 / Accepted: 23 May 2013 Ó Springer-Verlag Berlin Heidelberg 2013
Abstract The k-nearest neighbor rule is a simple and effective classifier for document classification. In this method, a document is put into a particular class if the class has the maximum representation among the k nearest neighbors of the documents in the training set. The k nearest neighbors of a test document are ordered based on their content similarity with the documents in the training set. Document classification is very challenging due to the large number of attributes present in the data set. Many attributes, due to the sparsity of the data, do not provide any information about a particular document. Thus, assigning a document to a predefined class for a large value of k may not be accurate when the margin of majority voting is one or when a tie occurs. This article tweaks the knn rule by putting a threshold on the majority voting and the method proposes a discrimination criterion to prune the actual search space of the test document. The proposed classification rule will enhance the confidence of the voting process and it makes no prior assumption about the number of nearest neighbors. The experimental evaluation using various well known text data sets show that the accuracy of the proposed method is significantly better than the traditional knn method as well as some other document classification methods. Keywords
k-nearest neighbor Text classification
T. Basu (&) C. A. Murthy Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India e-mail: [email protected] C. A. Murthy e-mail: [email protected]
1 Introduction The k-nearest neighbor (knn) rule is a commonly used, robust and simple classification technique for document classification [7]. The task of knn rule is to assign a test data point x to a particular class using a training sample set. It first finds the k-nearest neighbors from the training sample set using some distance function and assigns x to a particular class by taking a majority vote among the k-nearest neighbors. The performance of the nearest neighbor classification rule depends heavily upon the value of the neighborhood parameter k. Different values of k can change the classification result and hence choice of k is crucial for proper classification. The cross validation technique is generally used to estimate an optimal value of k [3]. But choosing an optimal k which provides satisfactory results for all test data points is still a difficult job. The cross-validation method uses the training data to select a single value of k, and then that selected value is used for classifying all observations. In knn rule we may put a point into a class which has a win by one vote to the next competing class. A point may also be arbitrarily assigned to a class if there is a tie between two competing classes i.e., if the number of members of the com
Data Loading...