Feature selection based on term frequency deviation rate for text classification

PDF / 2,890,824 Bytes
20 Pages / 595.224 x 790.955 pts Page_size
68 Downloads / 234 Views

Feature selection based on term frequency deviation rate for text classiﬁcation Hongfang Zhou1

· Yiming Ma1 · Xiang Li1

Accepted: 11 September 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Feature selection is a technique to select a subset of the most relevant features for modeling training. In this paper, a new concept of TDR is firstly proposed to improve the classification accuracy. Then, a TDR-based algorithm for text classification is advanced. Finally, the extensive experiments are made on seven datasets (K1a, K1b, WAP, R52, R8, 20NewGroups, and Cade12) for two classifiers of Naive Bayes and Support Vector Machine. The experimental results indicate that the new approach can improve the classification accuracy by an average percent of 7.9%. Keywords Text classification · Feature selection · Term frequency · Document frequency · Deviation ratio

1 Introduction With the advent of “Web 3.0” era, the digital information on the Internet has been increasing at an explosive speed. For instance, massive data is generated by various software on a daily basis. The amount of blogs on such social platforms as micro-blog and twitter reaches up to nearly ten millions per minute, and the comments on a popular blog can rapidly reach tens of thousands easily. In addition, such massive search engineers as Baidu, Google, etc. deal with countless search requests every day [1]. It will be not only inefficient to process a mass of digital data manually, but also unable to obtain the acceptable accuracy. Therefore, it is of great significance to classify the text files using machine learning algorithms and IT technologies in order to improve the efficiency and accuracy of decision-making [2]. By means of text classification, we can predict the class labels of the documents [3, 4]. It has a wide range of Hongfang Zhou

[email protected] Yiming Ma [email protected] Xiang Li [email protected] 1

School of Computer Science and Engineering, Xi’an University of Technology, No. 5 South Jinhua Road, Xi’an, Shaanxi, China

applications, such as theme detection, data retrieval [5], spam mail detection [6, 7], stock price prediction, digital library system, author certification [8], spam short message detection, web page classification [9], sentiment analysis [10, 11], early depression detection [12], and so on [13– 15]. Besides, the users’ browsing interests can be captured depending on their comments on social network sites [16], and the analysis of financial data can help the economists explore the economic status and development trend [17]. Additionally, the financial information can be extracted to provide a basis for decision-making [18]. Based on the consumption records and users’ evaluation on products, their preferences can be obtained for market sales [19]. Text classification mainly consists of three stages which are preprocessing, feature selection and classification [4, 20]. On account of the text data represented by highdimensional feature vectors [21], it is inevitable for the occurrenc

Data Loading...

Feature selection based on term frequency deviation rate for text classification

Recommend Documents

Binary Text Representation for Feature Selection

Text Classification Using K-Nearest Neighbor Algorithm and Firefly Algorithm for Text Feature Selection

Short Text Feature Extension Based on Improved Frequent Term Sets

A comparative study of feature selection methods for binary text streams classification

A Parallel Global TFIDF Feature Selection Using Hadoop for Big Data Text Classification

An application of MOGW optimization for feature selection in text classification

Entropy-Based Filter Selection in CNNs Applied to Text Classification

Application of Automatic Text-Classification Algorithm Based on Feature Extraction for Intelligent System of Transportat

Feature Selection and Extraction for Dogri Text Summarization

Coalition game based feature selection for text non-text separation in handwritten documents using LBP based features

Feature selection based on maximal neighborhood discernibility

Ensemble Feature Selection Method Based on Bio-inspired Algorithms for Multi-objective Classification Problem