Text Document Classification with PCA and One-Class SVM
We propose a document classifier based on principal component analysis (PCA) and one-class support vector machine (OCSVM), where PCA helps achieve dimensionality reduction and OCSVM performs classification. Initially, PCA is invoked on the document-term m
- PDF / 167,622 Bytes
- 9 Pages / 439.37 x 666.142 pts Page_size
- 109 Downloads / 328 Views
Abstract We propose a document classifier based on principal component analysis (PCA) and one-class support vector machine (OCSVM), where PCA helps achieve dimensionality reduction and OCSVM performs classification. Initially, PCA is invoked on the document-term matrix resulting in choosing the top few principal components. Later, OCSVM is trained on the records of the matrix corresponding to the negative class. Then, we tested the trained OCSVM with the records of the matrix corresponding to the positive class. The effectiveness of the proposed model is demonstrated on the popular datasets, viz., 20NG, malware, Syskill, & Webert, and customer feedbacks of a Bank. We observed that the hybrid yielded very high accuracies in all datasets.
⋅
⋅
Keywords Text mining Dimensionality reduction Document classification Principal component analysis One-class support vector machine
⋅
⋅
1 Introduction This text document classification is defined as the task of assigning text documents to predefined classes. Statistical and machine learning techniques cannot analyze text documents since text data is in an unstructured format. Therefore, the unstructured data must be converted into a structured form before any classifier is B. Shravan Kumar ⋅ V. Ravi (✉) Centre of Excellence in Analytics, Institute for Development and Research in Banking Technology, Castle Hills Road No. 1, Masab Tank, Hyderabad 500057, India e-mail: [email protected] B. Shravan Kumar e-mail: [email protected] B. Shravan Kumar School of of Computer & Information Sciences, University of Hyderabad, Hyderabad 500046, India © Springer Nature Singapore Pte Ltd. 2017 S.C. Satapathy et al. (eds.), Proceedings of the 5th International Conference on Frontiers in Intelligent Computing: Theory and Applications, Advances in Intelligent Systems and Computing 515, DOI 10.1007/978-981-10-3153-3_11
107
108
B. Shravan Kumar and V. Ravi
invoked. Text classification is fraught with challenges, including high dimensionality of the feature space, where each unique word represents a feature [1]. Sometimes it is also essential to reduce the input (document) space dimension, documents can be sparse with respect to the features when mapped into a structured format. In this paper, our objective is to reduce the feature space dimension, without compromising the performance of a classifier. According to Dorre et al. [2], text mining extracts the implicit knowledge from text documents. First step in text mining is to transform the text corpus into a document-term matrix. This requires preprocessing of text including the steps of tokenization, stop words removal, and stemming [3]. Once the document-term matrix is formed, data mining techniques are applied on the matrix to solve the underlying problem. Given the high dimensionality of the data, feature selection and/or dimensionality reduction is performed before invoking classifiers. Our research proposes a new method for document classification by performing dimensionality reduction with PCA followed by classifying the resultant matr
Data Loading...