Feature selection based on term frequency deviation rate for text classification

  • PDF / 2,890,824 Bytes
  • 20 Pages / 595.224 x 790.955 pts Page_size
  • 68 Downloads / 200 Views

DOWNLOAD

REPORT


Feature selection based on term frequency deviation rate for text classification Hongfang Zhou1

· Yiming Ma1 · Xiang Li1

Accepted: 11 September 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Feature selection is a technique to select a subset of the most relevant features for modeling training. In this paper, a new concept of TDR is firstly proposed to improve the classification accuracy. Then, a TDR-based algorithm for text classification is advanced. Finally, the extensive experiments are made on seven datasets (K1a, K1b, WAP, R52, R8, 20NewGroups, and Cade12) for two classifiers of Naive Bayes and Support Vector Machine. The experimental results indicate that the new approach can improve the classification accuracy by an average percent of 7.9%. Keywords Text classification · Feature selection · Term frequency · Document frequency · Deviation ratio

1 Introduction With the advent of “Web 3.0” era, the digital information on the Internet has been increasing at an explosive speed. For instance, massive data is generated by various software on a daily basis. The amount of blogs on such social platforms as micro-blog and twitter reaches up to nearly ten millions per minute, and the comments on a popular blog can rapidly reach tens of thousands easily. In addition, such massive search engineers as Baidu, Google, etc. deal with countless search requests every day [1]. It will be not only inefficient to process a mass of digital data manually, but also unable to obtain the acceptable accuracy. Therefore, it is of great significance to classify the text files using machine learning algorithms and IT technologies in order to improve the efficiency and accuracy of decision-making [2]. By means of text classification, we can predict the class labels of the documents [3, 4]. It has a wide range of  Hongfang Zhou

[email protected] Yiming Ma [email protected] Xiang Li [email protected] 1

School of Computer Science and Engineering, Xi’an University of Technology, No. 5 South Jinhua Road, Xi’an, Shaanxi, China

applications, such as theme detection, data retrieval [5], spam mail detection [6, 7], stock price prediction, digital library system, author certification [8], spam short message detection, web page classification [9], sentiment analysis [10, 11], early depression detection [12], and so on [13– 15]. Besides, the users’ browsing interests can be captured depending on their comments on social network sites [16], and the analysis of financial data can help the economists explore the economic status and development trend [17]. Additionally, the financial information can be extracted to provide a basis for decision-making [18]. Based on the consumption records and users’ evaluation on products, their preferences can be obtained for market sales [19]. Text classification mainly consists of three stages which are preprocessing, feature selection and classification [4, 20]. On account of the text data represented by highdimensional feature vectors [21], it is inevitable for the occurrenc