Statistical model for reproducibility in ranking-based feature selection

  • PDF / 4,261,636 Bytes
  • 32 Pages / 439.37 x 666.142 pts Page_size
  • 39 Downloads / 197 Views

DOWNLOAD

REPORT


Statistical model for reproducibility in ranking-based feature selection Ari Urkullu1

· Aritz Pérez2

· Borja Calvo1

Received: 19 October 2018 / Revised: 3 October 2020 / Accepted: 4 October 2020 © Springer-Verlag London Ltd., part of Springer Nature 2020

Abstract The stability of feature subset selection algorithms has become crucial in real-world problems due to the need for consistent experimental results across different replicates. Specifically, in this paper, we analyze the reproducibility of ranking-based feature subset selection algorithms. When applied to data, this family of algorithms builds an ordering of variables in terms of a measure of relevance. In order to quantify the reproducibility of ranking-based feature subset selection algorithms, we propose a model that takes into account all the different sized subsets of top-ranked features. The model is fitted to data through the minimization of an error function related to the expected values of Kuncheva’s consistency index for those subsets. Once it is fitted, the model provides practical information about the feature subset selection algorithm analyzed, such as a measure of its expected reproducibility or its estimated area under the receiver operating characteristic curve regarding the identification of relevant features. We test our model empirically using both synthetic and a wide range of real data. The results show that our proposal can be used to analyze feature subset selection algorithms based on rankings in terms of their reproducibility and their performance. Keywords Feature selection · Stability · Reproducibility · High dimensionality

1 Introduction Due to the large quantity of irreproducible results, concern has arisen to such an extent that a perception of a reproducibility crisis has spread through the scientific community [4]. Among other factors, researchers point out insufficient replication in the original laboratory, poor oversight, and low statistical power or poor analysis as the reasons for this

Electronic supplementary material The online version of this article (https://doi.org/10.1007/s10115-02001519-3) contains supplementary material, which is available to authorized users.

B

Ari Urkullu [email protected]

1

Department of Computer Science and Artificial Intelligence, University of the Basque Country (UPV/EHU), Paseo Manuel de Lardizabal, 1, 20018 Donostia, Gipuzkoa, Spain

2

Department of Data Science, Basque Center for Applied Mathematics (BCAM), Alameda Mazarredo, 14, 48009 Bilbao, Bizkaia, Spain

123

A. Urkullu et al.

crisis. Moreover, researchers identify better understanding of statistics, better mentoring and more robust designs as some of the possible solutions to boost reproducibility. Indeed, the American Statistical Association (ASA) warned recently about the problems derived from the inappropriate use of some statistical tools [41]. In this work, we tackle the feature selection problem, a problem in which the previously mentioned concerns regarding reproducibility are also present. Specifically, in