Boosting methods for multi-class imbalanced data classification: an experimental review

  • PDF / 1,964,651 Bytes
  • 47 Pages / 595.276 x 790.866 pts Page_size
  • 44 Downloads / 192 Views

DOWNLOAD

REPORT


Open Access

SURVEY PAPER

Boosting methods for multi‑class imbalanced data classification: an experimental review Jafar Tanha*  , Yousef Abdi, Negin Samadi, Nazila Razzaghi and Mohammad Asadpour *Correspondence: [email protected] Faculty of Electrical and Computer Engineering, University of Tabriz, P.O. Box 51666‑16471, Tabriz, Iran

Abstract  Since canonical machine learning algorithms assume that the dataset has equal number of samples in each class, binary classification became a very challenging task to discriminate the minority class samples efficiently in imbalanced datasets. For this reason, researchers have been paid attention and have proposed many methods to deal with this problem, which can be broadly categorized into data level and algorithm level. Besides, multi-class imbalanced learning is much harder than binary one and is still an open problem. Boosting algorithms are a class of ensemble learning methods in machine learning that improves the performance of separate base learners by combining them into a composite whole. This paper’s aim is to review the most significant published boosting techniques on multi-class imbalanced datasets. A thorough empirical comparison is conducted to analyze the performance of binary and multi-class boosting algorithms on various multi-class imbalanced datasets. In addition, based on the obtained results for performance evaluation metrics and a recently proposed criteria for comparing metrics, the selected metrics are compared to determine a suitable performance metric for multi-class imbalanced datasets. The experimental studies show that the CatBoost and LogitBoost algorithms are superior to other boosting algorithms on multi-class imbalanced conventional and big datasets, respectively. Furthermore, the MMCC is a better evaluation metric than the MAUC and G-mean in multi-class imbalanced data domains. Keywords:  Boosting algorithms, Imbalanced data, Multi-class classification, Ensemble learning

Introduction Imbalanced data set classification is a relatively new research line within the broader context of machine learning studies, which tries to learn from the skewed data distribution. A data set is imbalanced when the samples of one class consist of more instances than the rest of the classes in two-class and multi-class data sets [1]. Most of the standard machine learning algorithms show poor performance in this kind of datasets, because they tend to favor the majority class samples, resulting in poor predictive accuracy over the minority class [2]. Therefore, it becomes tough to learn the rare but

© The Author(s) 2020. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Common