A comparative study of feature selection methods for binary text streams classification
- PDF / 1,928,696 Bytes
- 17 Pages / 595.276 x 790.866 pts Page_size
- 46 Downloads / 226 Views
ORIGINAL PAPER
A comparative study of feature selection methods for binary text streams classification Matheus Bernardelli de Moraes1 · Andre Leon Sampaio Gradvohl1 Received: 12 November 2019 / Accepted: 4 October 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020
Abstract Text streams are a continuous flow of high-dimensional text, transmitted at high-volume and high-velocities. They are expected to be classified in real-time, which is challenging due to the high dimensionality of feature space. Applying feature selection algorithms is one solution to reduce text streams feature space and improve the learning process. However, since text streams are potentially unbounded, it is expected a change in their probabilistic distribution over time, the so-called Concept Drift. The concept drift impacts the feature selection process due to the feature drift when the relevance of features is also subject to changes over time. This paper presents a comparative study of six feature selection methods for binary text streams classification, even in the presence of feature drift. We also propose the Online Feature Selection with Evolving Regularization (OFSER) algorithm, a modified version of the Online Feature Selection (OFS) algorithm, which uses evolving regularization to dynamically penalize model complexity, reducing feature drift impacts on the feature selection process. We conducted the experimental analysis on eleven real-world, commonly used datasets for text classification. The OFSER algorithm showed F1-scores up to 12.92% higher than other algorithms in some cases. The results using Iman and Davenport and Bergmann–Hommel’s tests show that OFSER algorithm is statistically superior to Information Gain and Extremal Feature Selection algorithms in terms of improving the base classifier predictive power. Keywords Text streams · Feature drift · Feature selection · Evolving regularization · Binary classification · Concept drift
1 Introduction Classification of textual information is an important topic addressed by several research fields, such as sentiment analysis (Yue et al. 2019), spam detection (Méndez et al. 2006), topic labeling and intent detection (Brenes et al. 2009), among others. Traditionally, systems performed this task in static environments, where the document is fully available for training and testing in an indefinite period. Also, there
Electronic supplementary material The online version of this article (https://doi.org/10.1007/s12530-020-09357-y) contains supplementary material, which is available to authorized users. * Matheus Bernardelli de Moraes [email protected] Andre Leon Sampaio Gradvohl [email protected] 1
School of Technology, University of Campinas, Limeira, Brazil
are no variations in the correlation between terms and the instances labels. One of the major challenges in text classification problems is the high dimensionality of feature space. According to Yang and Pedersen, “feature space in text problems consists of unique terms (word or phrases) tha
Data Loading...