Multi-label feature ranking with ensemble methods

  • PDF / 1,196,964 Bytes
  • 19 Pages / 439.37 x 666.142 pts Page_size
  • 24 Downloads / 210 Views

DOWNLOAD

REPORT


Multi‑label feature ranking with ensemble methods Matej Petković1,2 · Sašo Džeroski1,2 · Dragi Kocev1,2  Received: 13 July 2019 / Revised: 10 June 2020 / Accepted: 24 August 2020 © The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2020

Abstract In this paper, we propose three ensemble-based feature ranking scores for multi-label classification (MLC), which is a generalisation of multi-class classification where the classes are not mutually exclusive. Each of the scores (Symbolic, Genie3 and Random forest) can be computed from three different ensembles of predictive clustering trees: Bagging, Random forest and Extra trees. We extensively evaluate the proposed scores on 24 benchmark MLC problems, using 15 standard MLC evaluation measures. We determine the ranking quality saturation points in terms of the ensemble sizes, for each rankingensemble pair, and show that quality rankings can be computed really efficiently (typically 10 or 50 trees suffice). We also show that the proposed feature rankings are relevant and determine the most appropriate ensemble method for every feature ranking score. We empirically prove that the proposed feature ranking scores outperform current state-of-theart methods in the quality of the rankings (for the majority of the evaluation measures), and in time efficiency. Finally, we determine the best performing feature ranking scores. Taking into account the quality of the rankings first and—in the case of ties—time efficiency, we identify the Genie3 feature ranking score as the optimal one. Keywords  Feature ranking · Multi-label classification · Ensemble-based methods · Predictive clustering trees

Editor: Larisa Soldatova, Joaquin Vanschoren. * Matej Petković [email protected] * Dragi Kocev [email protected] Sašo Džeroski [email protected] 1

Jožef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia

2

Jožef Stefan International Postgraduate School, Jamova 39, 1000 Ljubljana, Slovenia



13

Vol.:(0123456789)



Machine Learning

1 Introduction As opposed to the standard classification problems, where the goal is to learn a model that predicts one of the two or more mutually exclusive predefined class values, e.g., predict whether a given board position leads to a win, draw or loss if both players are playing optimally, multi-label classification (MLC) is a predictive modeling task where the examples can be labeled with more than one (or even zero) of the labels from a predefined set of labels L  . In this case, we denote examples as (x, y) , where (1) x is a vector of values of features xi that are either numeric (the domain of xi is a subset of ℝ ) or nominal (the domain of xi is a finite set of values), and (2) y is a subset of the label set L  . The elements of y are the labels that are relevant for a given example. MLC problems are receiving more and more attention from the research community. For example, one of the use cases of MLC is labeling pictures with objects that appear on them. Due to the abundance of data m