Enhancing techniques for learning decision trees from imbalanced data

  • PDF / 981,980 Bytes
  • 69 Pages / 439.37 x 666.142 pts Page_size
  • 71 Downloads / 206 Views

DOWNLOAD

REPORT


Enhancing techniques for learning decision trees from imbalanced data Ikram Chaabane1 · Radhouane Guermazi2 · Mohamed Hammami3 Received: 20 January 2017 / Revised: 29 January 2019 / Accepted: 18 February 2019 © Springer-Verlag GmbH Germany, part of Springer Nature 2019

Abstract Several machine learning techniques assume that the number of objects in considered classes is approximately similar. Nevertheless, in real-world applications, the class of interest to be studied is generally scarce. The data imbalance status may allow high global accuracy through most standard learning algorithms, but it poses a real challenge when considering the minority class accuracy. To deal with this issue, we introduce in this paper a novel adaptation of the decision tree algorithm to imbalanced data situations. A new asymmetric entropy measure is proposed. It adjusts the most uncertain class distribution to the a priori class distribution and involves it in the node splitting-process. Unlike most competitive split criteria, which include only the maximum uncertainty vector in their formula, the proposed entropy is customizable with an adjustable concavity to better comply with the system expectations. The experimental results across thirty-five differently class-imbalanced data-sets show significant improvements over various split criteria adapted for imbalanced situations. Furthermore, being combined with sampling strategies and based-ensemble methods, our entropy proves significant enhancements on the minority class prediction, along with a good handling of the data difficulties related to the class imbalance problem. Keywords Asymmetric decision trees · Imbalanced data · Entropy measures · Classification problem · Index of balanced accuracy

B

Ikram Chaabane [email protected] Radhouane Guermazi [email protected] Mohamed Hammami [email protected]

1

MIRACL-FSEG, Sfax University, Sfax, Tunisia

2

Saudi Electronic University, Riyadh, Kingdom of Saudi Arabia

3

MIRACL-FSS, Sfax University, Sfax, Tunisia

123

I. Chaabane et al.

1 Introduction Recently, the class imbalance problem has drawn a significant amount of attention of researchers in the field of data mining. It occurs when there is a minority class (called positive) which is weakly represented, but needs to be accurately predicted in a classification task. Such situations are frequently encountered in real-world applications. In the banking field, for example, we may have more solvent customers (negative examples) than insolvent ones (positive examples). Imbalance affects not only the class distribution but the cost of each class as well. Then, in our example, a wrong decision for an insolvent customer might cause losses for the bank if the customer borrowed an important loan. However, a misclassification of a solvent customer would not bring gains in the worst of the cases. Such situations can also be met in social sciences (Bressoux 2010), credit card fraud detection (Shen et al. 2007), medical diagnostic imaging (Bosch et al. 2007), bio-inform