A comparative study of feature selection methods for binary text streams classification

PDF / 1,928,696 Bytes
17 Pages / 595.276 x 790.866 pts Page_size
46 Downloads / 273 Views

ORIGINAL PAPER

A comparative study of feature selection methods for binary text streams classification Matheus Bernardelli de Moraes1 · Andre Leon Sampaio Gradvohl1 Received: 12 November 2019 / Accepted: 4 October 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract Text streams are a continuous flow of high-dimensional text, transmitted at high-volume and high-velocities. They are expected to be classified in real-time, which is challenging due to the high dimensionality of feature space. Applying feature selection algorithms is one solution to reduce text streams feature space and improve the learning process. However, since text streams are potentially unbounded, it is expected a change in their probabilistic distribution over time, the so-called Concept Drift. The concept drift impacts the feature selection process due to the feature drift when the relevance of features is also subject to changes over time. This paper presents a comparative study of six feature selection methods for binary text streams classification, even in the presence of feature drift. We also propose the Online Feature Selection with Evolving Regularization (OFSER) algorithm, a modified version of the Online Feature Selection (OFS) algorithm, which uses evolving regularization to dynamically penalize model complexity, reducing feature drift impacts on the feature selection process. We conducted the experimental analysis on eleven real-world, commonly used datasets for text classification. The OFSER algorithm showed F1-scores up to 12.92% higher than other algorithms in some cases. The results using Iman and Davenport and Bergmann–Hommel’s tests show that OFSER algorithm is statistically superior to Information Gain and Extremal Feature Selection algorithms in terms of improving the base classifier predictive power. Keywords Text streams · Feature drift · Feature selection · Evolving regularization · Binary classification · Concept drift

1 Introduction Classification of textual information is an important topic addressed by several research fields, such as sentiment analysis (Yue et al. 2019), spam detection (Méndez et al. 2006), topic labeling and intent detection (Brenes et al. 2009), among others. Traditionally, systems performed this task in static environments, where the document is fully available for training and testing in an indefinite period. Also, there

Electronic supplementary material The online version of this article (https://doi.org/10.1007/s12530-020-09357-y) contains supplementary material, which is available to authorized users. * Matheus Bernardelli de Moraes [email protected] Andre Leon Sampaio Gradvohl [email protected] 1

School of Technology, University of Campinas, Limeira, Brazil

are no variations in the correlation between terms and the instances labels. One of the major challenges in text classification problems is the high dimensionality of feature space. According to Yang and Pedersen, “feature space in text problems consists of unique terms (word or phrases) tha

Data Loading...

A comparative study of feature selection methods for binary text streams classification

Recommend Documents

Binary Text Representation for Feature Selection

Text Classification Using K-Nearest Neighbor Algorithm and Firefly Algorithm for Text Feature Selection

Feature Selection for Classification of Breast Cancer in Histopathology Images: A Comparative Investigation Using Wavele

A Parallel Global TFIDF Feature Selection Using Hadoop for Big Data Text Classification

Machine Learning for Web Intrusion Detection: A Comparative Analysis of Feature Selection Methods mRMR and PFI

An application of MOGW optimization for feature selection in text classification

Feature selection based on term frequency deviation rate for text classification

Feature Selection Algorithms for Plant Leaf Classification: A Survey

Feature Selection and Extraction for Dogri Text Summarization

Binary JAYA Algorithm with Adaptive Mutation for Feature Selection

Univariate Feature Selection Techniques for Classification of Epileptic EEG Signals

A Feature Selection Approach to Visual Domain Adaptation in Classification