A new hybrid stability measure for feature selection
- PDF / 2,246,403 Bytes
- 16 Pages / 595.224 x 790.955 pts Page_size
- 108 Downloads / 263 Views
A new hybrid stability measure for feature selection Akshata K. Naik1
· Venkatanareshbabu Kuppili1 · Damodar Reddy Edla1
© Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract Feature Selection (FS) algorithms are applied in bioinformatics applications to identify the disease causing genes. Performance of such algorithms is measured in terms of accuracy of the model and stability of FS algorithms. Stability evaluates the identical replication of feature sets obtained after every execution. Recently research has shown that a stability measure must satisfy set of properties like, fully defined, monotonicity, boundedness, deterministic maximum stability, and correction for chance. Among the existing stability measures, only Nogueira’s frequency based stability measure satisfies all the required properties. However, frequency based stability measures fail to discriminate among the cases when overall frequency of features are same. In order to address this issue, the paper proposes a hybrid similarity based stability measure which satisfies all the desirable properties, as mentioned earlier. The proposed stability measure is unique as it is the first similarity based stability measure that satisfies all the required properties. Also, all these essential properties are mathematically established. Further, the paper also proposes a combination of frequency based and similarity based measure which preserves all the aspects of both the approaches. The work presented also analyzes the stability performance of LASSO and Elastic Net, using synthetic and microarray gene expression datasets. Elastic Net depicts higher stability and selection of relevant features. Keywords Feature evaluation and selection · Gene selection · Stability measure · Similarity-based stability · Frequency-based stability
1 Introduction High dimensional data usually leads to overfitting problem and high computational complexity in machine learning tasks. One such example of high dimensional dataset in bioinformatics, is the microarray gene expression dataset. Dimensionality reduction techniques have gained a great impetus as a solution to deal with such datasets. Dimensionality reduction methods can be broadly classified into two categories. The first category is the Feature Selection (FS), where a subset of relevant and nonredundant features is chosen from the original larger set of features. The second category is feature extraction, where the high dimensional data is mapped to lower dimension without preserving the actual feature set.
Akshata K. Naik
[email protected] 1
National Institute of Technology Goa, Farmagudi Ponda Goa, India
Recently, in bioinformatics applications, FS techniques are applied to select set of disease causing genes and the process is termed as gene selection [17]. The performance of gene selection algorithms are generally measured in terms of accuracy of learning model. It is worth pointing out that two different subsets of genes may give the same accuracy due to presence of highly corre
Data Loading...