Text Document Classification with PCA and One-Class SVM

We propose a document classifier based on principal component analysis (PCA) and one-class support vector machine (OCSVM), where PCA helps achieve dimensionality reduction and OCSVM performs classification. Initially, PCA is invoked on the document-term m

PDF / 167,622 Bytes
9 Pages / 439.37 x 666.142 pts Page_size
109 Downloads / 348 Views

DOWNLOAD

REPORT

Abstract We propose a document classiﬁer based on principal component analysis (PCA) and one-class support vector machine (OCSVM), where PCA helps achieve dimensionality reduction and OCSVM performs classiﬁcation. Initially, PCA is invoked on the document-term matrix resulting in choosing the top few principal components. Later, OCSVM is trained on the records of the matrix corresponding to the negative class. Then, we tested the trained OCSVM with the records of the matrix corresponding to the positive class. The effectiveness of the proposed model is demonstrated on the popular datasets, viz., 20NG, malware, Syskill, & Webert, and customer feedbacks of a Bank. We observed that the hybrid yielded very high accuracies in all datasets.

⋅

⋅

Keywords Text mining Dimensionality reduction Document classiﬁcation Principal component analysis One-class support vector machine

⋅

⋅

1 Introduction This text document classiﬁcation is deﬁned as the task of assigning text documents to predeﬁned classes. Statistical and machine learning techniques cannot analyze text documents since text data is in an unstructured format. Therefore, the unstructured data must be converted into a structured form before any classiﬁer is B. Shravan Kumar ⋅ V. Ravi (✉) Centre of Excellence in Analytics, Institute for Development and Research in Banking Technology, Castle Hills Road No. 1, Masab Tank, Hyderabad 500057, India e-mail: [email protected] B. Shravan Kumar e-mail: [email protected] B. Shravan Kumar School of of Computer & Information Sciences, University of Hyderabad, Hyderabad 500046, India © Springer Nature Singapore Pte Ltd. 2017 S.C. Satapathy et al. (eds.), Proceedings of the 5th International Conference on Frontiers in Intelligent Computing: Theory and Applications, Advances in Intelligent Systems and Computing 515, DOI 10.1007/978-981-10-3153-3_11

107

108

B. Shravan Kumar and V. Ravi

invoked. Text classiﬁcation is fraught with challenges, including high dimensionality of the feature space, where each unique word represents a feature [1]. Sometimes it is also essential to reduce the input (document) space dimension, documents can be sparse with respect to the features when mapped into a structured format. In this paper, our objective is to reduce the feature space dimension, without compromising the performance of a classiﬁer. According to Dorre et al. [2], text mining extracts the implicit knowledge from text documents. First step in text mining is to transform the text corpus into a document-term matrix. This requires preprocessing of text including the steps of tokenization, stop words removal, and stemming [3]. Once the document-term matrix is formed, data mining techniques are applied on the matrix to solve the underlying problem. Given the high dimensionality of the data, feature selection and/or dimensionality reduction is performed before invoking classiﬁers. Our research proposes a new method for document classiﬁcation by performing dimensionality reduction with PCA followed by classifying the resultant matr

Data Loading...

Text Document Classification with PCA and One-Class SVM

Recommend Documents

Short Text Classification Technology Based on KNN+Hierarchy SVM

Label-Wise Document Pre-training for Multi-label Text Classification

Text/Document Summarization

Topic modeling combined with classification technique for extractive multi-document text summarization

Text Segmentation for Document Recognition

Text Classification

Text Classification

Benchmarking performance of machine and deep learning-based methodologies for Urdu text document classification

Finding Answers in a Text Document

Data Augmentation with Transformers for Text Classification

Text document classification using fuzzy rough set based on robust nearest neighbor (FRS-RNN)

Text classification and sentiment analysis