Is There a Relationship Between Neighborhoods of Minority Class Instances and the Performance of Classification Methods?

The performance of classification methods is notably damaged with imbalanced data sets. Although some studies to analyze this behavior have realized before, most of the conclusions obtained from experiments correspond to synthetic data sets. In this paper

  • PDF / 585,700 Bytes
  • 12 Pages / 439.37 x 666.142 pts Page_size
  • 35 Downloads / 188 Views

DOWNLOAD

REPORT


versidad Autónoma del Estado de México, CU UAEM Zumpango, Camino, Viejo a Jilotzingo s/n Col. Valle Hermoso, 5600 Zumpango, Estado de México, Mexico [email protected] 2 Universidad Autónoma del Estado de México, CU UAEM Texcoco, Av. Jardín Zumpango s/n, Fracc. El Tejocote, 56259 Texcoco, Mexico

Abstract. The performance of classification methods is notably damaged with imbalanced data sets. Although some studies to analyze this behavior have realized before, most of the conclusions obtained from experiments correspond to synthetic data sets. In this paper, we study the relationship between the performance of five classification methods and neighbors of minority class instances. According to the results of experiments, we found strong empiric evidence that the type of neighborhoods of minority class instances affect classification accuracy. Indeed, we observe that the type of neighborhood is more important than the imbalance rate. In order to validate the results, we use ten real-world imbalanced data sets, and measure AUC ROC and True Positive Rates. Keywords: Imbalanced classification SMOTE



Nearest neighbors



Minority class



1 Introduction Imbalanced data sets contain a large number of instances of one category identified as the majority class, and just few instances of the opposite type, the minority class. Currently, there are many applications that generate this type of data, for example, medical diagnosis [1–3], fraud detection in telecommunications [4] and agriculture [5]. By far, the underlying concept hidden in the minority class instances is the most important [6], but also the hardest to capture by classification methods. Classic classification methods were designed based on the hypothesis that data sets are balanced. Therefore, in most cases these methods ignore the minority class, and just focus on predicting correctly the instances of the majority class [6, 7]. In the literature, most of research on the imbalance problem has focused on three main topics [8]. The first one is the proposal of new methods to face the imbalance problem [9–13]. The second is about measuring the performance of classification © Springer International Publishing Switzerland 2016 D.-S. Huang et al. (Eds.): ICIC 2016, Part I, LNCS 9771, pp. 750–761, 2016. DOI: 10.1007/978-3-319-42291-6_75

Is There a Relationship Between Neighborhoods of Minority Class Instances

751

methods with imbalanced data sets [14–17], most of these sets are synthetic or from specific domains. The third topic groups the works that study the complexity of data sets [18, 19]. These researches are very valuable to understand why classification methods fail with imbalanced data sets. Japkowicz and Stephen [18] realized a set of experiments to establish possible relationships between concept complexity, size of the training set and class imbalance level. Conclusions from such experiments suggest that data complexity hinders the performance of classifiers. Prati et al. [20] and Batista et al. [15] show that class overlapping has a more negative impact than imbalan