Feature selection with multi-objective genetic algorithm based on a hybrid filter and the symmetrical complementary coef

  • PDF / 1,074,087 Bytes
  • 18 Pages / 595.224 x 790.955 pts Page_size
  • 40 Downloads / 196 Views

DOWNLOAD

REPORT


Feature selection with multi-objective genetic algorithm based on a hybrid filter and the symmetrical complementary coefficient Rui Zhang1 · Zuoquan Zhang1

· Di Wang1 · Marui Du1

Accepted: 16 October 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract With the expansion of data size and data dimension, feature selection attracts more and more attention. In this paper, we propose a novel feature selection algorithm, namely, Hybrid filter and Symmetrical Complementary Coefficient based Multi-Objective Genetic Algorithm feature selection (HSMOGA). HSMOGA contains a new hybrid filter, Symmetrical Complementary Coefficient which is a well-performed metric of feature interactions proposed recently, and a novel way to limit feature subset’s size. A new Pareto-based ranking function is proposed when solving multi-objective problems. Besides, HSMOGA starts with a novel step called knowledge reserve, which precalculate the knowledge required for fitness function calculation and initial population generation. In this way, HSMOGA is classifier-independent in each generation, and its initial population generation makes full use of the knowledge of data set which makes solutions converge faster. Compared with other GA-based feature selection methods, HSMOGA has a much lower time complexity. According to experimental results, HSMOGA outperforms other nine state-of-art feature selection algorithms including five classic and four more recent algorithms in terms of kappa coefficient, accuracy, and G-mean for the data sets tested. Keywords Feature selection · Feature interaction · Hybrid filter · Symmetrical complementary coefficient · Multi-objective genetic algorithm

1 Introduction Feature selection plays an important role in many aspects of machine learning such as multivariate classification (including the binary classification) where each instance has just one class, and the multi-label classification [18, 19] where there is more than one class variable or each instance can belong to multiple classes at the same time, and This work is supported by the National Natural Science Foundation of China under Grant 51727813.  Zuoquan Zhang

[email protected] Rui Zhang [email protected] Di Wang [email protected] Marui Du [email protected] 1

School of Science, Beijing Jiaotong University, Beijing, China

sometimes there are dependencies between these classes. These classification tasks all need to learn the input data, and the features are used to characterize the data from different perspectives. Whereas, for a data set, sometimes many features of it are not helpful for learning and mining tasks, or even harmful [20]. Thus, correct features are essential. Nowadays, there is a growing requirement of feature selection, as data sets are getting bigger and wider.

1.1 Literature review Features which need to be coped with can be divided into three types, i.e., irrelevant features, redundant features and interactive features. Irrelevant feature refers to the one which does not help with learning an