Kernelized fuzzy rough sets based online streaming feature selection for large-scale hierarchical classification
- PDF / 1,557,790 Bytes
- 14 Pages / 595.224 x 790.955 pts Page_size
- 17 Downloads / 195 Views
Kernelized fuzzy rough sets based online streaming feature selection for large-scale hierarchical classification Shengxing Bai1,2 · Yaojin Lin1,2 · Yan Lv1,2 · Jinkun Chen3 · Chenxi Wang1
© Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract In recent years, many online streaming feature selection approaches focus on flat data, which means that all data are taken as a whole. However, in the era of big data, not only the feature space of data has unknown and evolutionary characteristics, but also the label space of data exists hierarchical structure. To address this problem, an online streaming feature selection framework for large-scale hierarchical classification task is proposed. The framework consists of three parts: (1) a new hierarchical data-oriented kernelized fuzzy rough model with sibling strategy is constructed, (2) the online important feature is selected based on feature correlation analysis, and (3) the online redundant feature is deleted based on feature redundancy. Finally, an empirical study using several hierarchical classification data sets manifests that the proposed method outperforms other state-of-the-art online streaming feature selection methods. Keywords Online feature selection · Hierarchical classification · Kernelized fuzzy rough sets · Sibling strategy
1 Introduction Hierarchies Taxonomies are popular for organizing large volume data sets in various application domains [9, 15]. For example, ImageNet is an image database organized refer to the WordNet hierarchy (currently only the nouns), in which hundreds and thousands of images are used to depict each node of the hierarchy. It also has been used in many areas including biology data [9], Wikipedia [24], geographical data [39], and text data [3, 6, 44]. Therefore, large-scale hierarchical classification learning is an important and popular learning paradigm in machine learning and data mining communities [9, 15]. From the viewpoint of biologists, the discovery of new species is attributed to the new features detected. Furthermore, these new features are now available in the Yaojin Lin
[email protected] 1
School of Computer Science, Minnan Normal University, Zhangzhou, 363000, People’s Republic of China
2
Laboratory of Data Science, Intelligence Application, Minnan Normal University, Zhangzhou, 363000, People’s Republic of China
3
School of Mathematics and Statistics, Minnan Normal University, Zhangzhou, 363000, People’s Republic of China
existed species [50]. Therefore, the challenge of hierarchical classification learning is that the full feature space is unknown before learning begins. As we know, the full feature space determines the final label category of the samples. For example, in the diagnosis of lung cancer, through clinical testing in a period, doctors can gradually obtain clinical signs of lung cancer patients. Further, these patients may need to be diagnosed with small cell lung cancer, which is the subcategory of lung cancer. This phenomenon suggests that it is infeasible to collect all
Data Loading...