Applying Machine Learning Algorithms for News Articles Categorization: Using SVM and kNN with TF-IDF Approach

News articles categorization is a supervised learning approach in which news articles are assigned category labels based on likelihood demonstrated by a training set of labeled articles. A system for automatic categorization of news articles into a standa

  • PDF / 451,603 Bytes
  • 11 Pages / 439.37 x 666.142 pts Page_size
  • 6 Downloads / 302 Views

DOWNLOAD

REPORT


Abstract News articles categorization is a supervised learning approach in which news articles are assigned category labels based on likelihood demonstrated by a training set of labeled articles. A system for automatic categorization of news articles into a standard set of categories has been implemented. The proposed work will use Term Frequency–Inverse Document Frequency (TF-IDF) term weighting scheme for optimization of classification techniques to get more optimized results and use two supervised learning approaches, i.e., Support Vector Machine (SVM) and K-Nearest neighbor (kNN) and compare the performances of both classifiers. Each news document is preprocessed and transformed into a term-document matrix (Tsoumakas et al. in Data mining and knowledge discovery handbook. Springer, Berlin, pp 667–685 (2010) [1]). After preprocessing and transforming each news article into a vector of weights, TF-IDF term weighting scheme was used for weighting the word. TF-IDF weighted the words calculating the number of words that appear in a document. An unknown news item is also transformed into a vector of keyword weights, and then categorized into suitable categories such as Sports, Business, and Science and Technology. The system purposed in research work was trained on the collection of approximately 300 categorized news articles extracted from the various Indian newspaper websites and tested on a different set of 60 randomly extracted news items from the same sources (Trstenjak et al. in Proc Eng 69:1356–1364 (2014) [2], Buana et al. Int J Comput Appl 50:37–42 [3]). It has been observed that the performances of both algorithms improve when TF-IDF approach is used.





Keywords News articles K-Nearest neighbor (kNN) Support vector machine (SVM) Term frequency–inverse document frequency (TF-IDF)



Kanika (&)  Sangeeta I. K. Gujral Punjab Technical University, Kapurthala, India e-mail: [email protected] Sangeeta e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2019 A. K. Luhach et al. (eds.), Smart Computational Strategies: Theoretical and Practical Aspects, https://doi.org/10.1007/978-981-13-6295-8_9

95

96

Kanika and Sangeeta

1 Introduction Automated news articles categorization or classification is the automatic classification of news articles or documents under predefined categories or classes. It is one of the applications of text classification. It is a supervised learning approach. Text classification is a kind of procedure related to Natural Language Processing (NLP). It finds relational mode (classifier) between text’s attributes (feature) and text’s category according to a labeled training text corpus, and then utilizes the classifier to classify new text corpus. Text classification can be divided into two parts: training and classifying. The purpose of training is to structure classifier, which can be used to classify new texts by the connection between training text and category. Classifying means to make the unknown new text assigned with the known category label [4].

2 Related Work Tr