Towards enriching the quality of k -nearest neighbor rule for document classification

PDF / 493,652 Bytes
9 Pages / 595.276 x 790.866 pts Page_size
97 Downloads / 184 Views

ORIGINAL ARTICLE

Towards enriching the quality of k-nearest neighbor rule for document classification Tanmay Basu • C. A. Murthy

Received: 26 September 2012 / Accepted: 23 May 2013 Ó Springer-Verlag Berlin Heidelberg 2013

Abstract The k-nearest neighbor rule is a simple and effective classifier for document classification. In this method, a document is put into a particular class if the class has the maximum representation among the k nearest neighbors of the documents in the training set. The k nearest neighbors of a test document are ordered based on their content similarity with the documents in the training set. Document classification is very challenging due to the large number of attributes present in the data set. Many attributes, due to the sparsity of the data, do not provide any information about a particular document. Thus, assigning a document to a predefined class for a large value of k may not be accurate when the margin of majority voting is one or when a tie occurs. This article tweaks the knn rule by putting a threshold on the majority voting and the method proposes a discrimination criterion to prune the actual search space of the test document. The proposed classification rule will enhance the confidence of the voting process and it makes no prior assumption about the number of nearest neighbors. The experimental evaluation using various well known text data sets show that the accuracy of the proposed method is significantly better than the traditional knn method as well as some other document classification methods. Keywords

k-nearest neighbor Text classification

T. Basu (&) C. A. Murthy Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India e-mail: [email protected] C. A. Murthy e-mail: [email protected]

1 Introduction The k-nearest neighbor (knn) rule is a commonly used, robust and simple classification technique for document classification [7]. The task of knn rule is to assign a test data point x to a particular class using a training sample set. It first finds the k-nearest neighbors from the training sample set using some distance function and assigns x to a particular class by taking a majority vote among the k-nearest neighbors. The performance of the nearest neighbor classification rule depends heavily upon the value of the neighborhood parameter k. Different values of k can change the classification result and hence choice of k is crucial for proper classification. The cross validation technique is generally used to estimate an optimal value of k [3]. But choosing an optimal k which provides satisfactory results for all test data points is still a difficult job. The cross-validation method uses the training data to select a single value of k, and then that selected value is used for classifying all observations. In knn rule we may put a point into a class which has a win by one vote to the next competing class. A point may also be arbitrarily assigned to a class if there is a tie between two competing classes i.e., if the number of members of the com

Data Loading...

Towards enriching the quality of k -nearest neighbor rule for document classification

Recommend Documents

Performance Analysis of Nearest Neighbor, K-Nearest Neighbor and Weighted K-Nearest Neighbor for the Classification of A

K-Nearest Neighbor Queries Over Encrypted Data

UAV Remote Sensing for Campus Monitoring: A Comparative Evaluation of Nearest Neighbor and Rule-Based Classification

A Weighted Combination Method of Multiple K-Nearest Neighbor Classifiers for EEG-Based Cognitive Task Classification

Text document classification using fuzzy rough set based on robust nearest neighbor (FRS-RNN)

Using a Genetic Algorithm for Editing k-Nearest Neighbor Classifiers

Text Classification Using K-Nearest Neighbor Algorithm and Firefly Algorithm for Text Feature Selection

Hubness-based fuzzy measures for high-dimensional k -nearest neighbor classification

Nearest Neighbor

A Sub-linear Time Algorithm for Approximating k-Nearest-Neighbor with Full Quality Guarantee

Reverse Nearest Neighbor Search

Nearest Neighbor Query