Active learning for hierarchical multi-label classification

  • PDF / 2,089,512 Bytes
  • 35 Pages / 439.37 x 666.142 pts Page_size
  • 10 Downloads / 208 Views

DOWNLOAD

REPORT


Active learning for hierarchical multi-label classification Felipe Kenji Nakano1,2

· Ricardo Cerri3 · Celine Vens1,2

Received: 13 September 2019 / Accepted: 4 July 2020 © The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2020

Abstract Due to technological advances, a massive amount of data is produced daily, presenting challenges for application areas where data needs to be labelled by a domain specialist or by expensive procedures, in order to be useful for supervised machine learning purposes. In order to select which data points will provide more information when labelled, one can make use of active learning methods. Active learning (AL) is a subfield of machine learning which addresses methods to build models with fewer, but more representative instances. Even though AL has been vastly studied, it has not been thoroughly investigated in hierarchical multi-label classification, a learning task where multiple class labels can be assigned to an instance and these labels are hierarchically structured. In this work, we provide a public framework containing baseline and state-of-the-art algorithms suitable for this task. Additionally, we also propose a new algorithm, namely Hierarchical Query-By-Committee (H-QBC), which is validated on datasets from different domains. Our results show that H-QBC is capable of providing superior predictive performance results compared to its competitors, while being computationally efficient and parameter free.

Responsible editor: Ira Assent, Carlotta Domeniconi, Aristides Gionis, Eyke Hüllermeier Electronic supplementary material The online version of this article (https://doi.org/10.1007/s10618020-00704-w) contains supplementary material, which is available to authorized users.

B

Felipe Kenji Nakano [email protected] Ricardo Cerri [email protected] Celine Vens [email protected]

1

Department of Public Health and Primary Care, KU Leuven Campus KULAK, Etienne Sabbelaan 53, 8500 Kortrijk, Belgium

2

Itec, imec Research Group at KU Leuven, Etienne Sabbelaan 53, 8500 Kortrijk, Belgium

3

Department of Computer Science, Federal University of São Carlos, Rodovia Washington Luís, Km 235, São Carlos, SP 13565-905, Brazil

123

F. K. Nakano et al.

Keywords Active learning · Hierarchical multi-label classification · Predictive clustering trees

1 Introduction Due to recent advances in technology, an exponential amount of data is produced daily, having an impact on many scientific areas. From the machine learning perspective, this availability of data is promising, since it is well-known that models are likely to perform better when learned from more data. In scenarios where data must be labelled by a domain specialist or by expensive procedures, however, it may present challenges, since labelling requires a substantial financial and time-wise commitment. Frequently in such scenarios, the amount of labelled data is scarce, whereas unlabelled data is abundant. As a countermeasure, active learning (AL) provides algor