A novel semi supervised approach for text classification
- PDF / 701,162 Bytes
- 11 Pages / 595.276 x 790.866 pts Page_size
- 40 Downloads / 216 Views
ORIGINAL RESEARCH
A novel semi supervised approach for text classification Debaditya Barman1 • Nirmalya Chowdhury2
Received: 25 December 2017 / Accepted: 2 April 2018 Ó Bharati Vidyapeeth’s Institute of Computer Applications and Management 2018
Abstract Text categorization, also known as text classification is a supervised classification problem. It aims to assign a predefined class label or group to a new or unknown text document. Most of the time we need a collection of large data from each class to train the classifier. It may be noted that, it is very hard or expensive to collect labelled text data. In most cases we assign the label manually which is neither cost effective nor efficient. In this paper, we have introduced a semi-supervised classification approach where the learner needs very small amount of labelled data with a large amount of unlabeled data to assign a class label to a new or unknown text document. The proposed method uses Kohonen self organizing map (SOM) for labelling the unlabeled data and three classifiers namely support vector machine (SVM), Naı¨ve Bayes (NB), and decision tree (DT): classification and regression tree (CART) for observing the accuracy of classification. The experimental results obtained show the effectiveness of our proposed method. Keywords Text categorization Semi supervised learning Kohonen self organizing map Naı¨ve Bayes Decision tree Support vector machine
& Nirmalya Chowdhury [email protected] Debaditya Barman [email protected] 1
Department of Computer and System Sciences, VisvaBharati, Santiniketan 731235, India
2
Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India
1 Introduction Advancement of social networking, blogging, micro-blogging has provided the opportunity to develop various applications in natural language processing (NLP). Text classification is a very popular research area of NLP. Over the last decade it grew exponentially [1–6] due to easy accessibility of the digital text documents. Text classification implies automated assignment of textual data to predefined classes. Sometimes either the number of such classes is not known or the class labels are not known. In this case initially some clustering technique is employed to obtain the appropriate grouping of a given set of text documents, then such groups are labelled based on some criteria or heuristic. Several machine learning algorithms had been applied successfully to categorize text documents based on their content. Perhaps Naive Bayes (NB) algorithm is the frequently used classifier to solve text categorization problem. Researchers used two different generative models: multivariate Bernoulli [7–11] and multinomial [12–16] event model while designing the NB classifier. Algorithms based on artificial neural network (ANN) [17–19], decision tree (DT) [20–23], k-nearest neighbor (KNN) [24–27], and SVM [28, 29] had been employed frequently to solve the text categorization problem. Ruiz and Srinivasan [5] proposed a method based on princi
Data Loading...