Is There a Relationship Between Neighborhoods of Minority Class Instances and the Performance of Classification Methods?

The performance of classification methods is notably damaged with imbalanced data sets. Although some studies to analyze this behavior have realized before, most of the conclusions obtained from experiments correspond to synthetic data sets. In this paper

PDF / 585,700 Bytes
12 Pages / 439.37 x 666.142 pts Page_size
35 Downloads / 212 Views

DOWNLOAD

REPORT

versidad Autónoma del Estado de México, CU UAEM Zumpango, Camino, Viejo a Jilotzingo s/n Col. Valle Hermoso, 5600 Zumpango, Estado de México, Mexico [email protected] 2 Universidad Autónoma del Estado de México, CU UAEM Texcoco, Av. Jardín Zumpango s/n, Fracc. El Tejocote, 56259 Texcoco, Mexico

Abstract. The performance of classiﬁcation methods is notably damaged with imbalanced data sets. Although some studies to analyze this behavior have realized before, most of the conclusions obtained from experiments correspond to synthetic data sets. In this paper, we study the relationship between the performance of ﬁve classiﬁcation methods and neighbors of minority class instances. According to the results of experiments, we found strong empiric evidence that the type of neighborhoods of minority class instances affect classiﬁcation accuracy. Indeed, we observe that the type of neighborhood is more important than the imbalance rate. In order to validate the results, we use ten real-world imbalanced data sets, and measure AUC ROC and True Positive Rates. Keywords: Imbalanced classiﬁcation SMOTE

Nearest neighbors

Minority class

1 Introduction Imbalanced data sets contain a large number of instances of one category identiﬁed as the majority class, and just few instances of the opposite type, the minority class. Currently, there are many applications that generate this type of data, for example, medical diagnosis [1–3], fraud detection in telecommunications [4] and agriculture [5]. By far, the underlying concept hidden in the minority class instances is the most important [6], but also the hardest to capture by classiﬁcation methods. Classic classiﬁcation methods were designed based on the hypothesis that data sets are balanced. Therefore, in most cases these methods ignore the minority class, and just focus on predicting correctly the instances of the majority class [6, 7]. In the literature, most of research on the imbalance problem has focused on three main topics [8]. The ﬁrst one is the proposal of new methods to face the imbalance problem [9–13]. The second is about measuring the performance of classiﬁcation © Springer International Publishing Switzerland 2016 D.-S. Huang et al. (Eds.): ICIC 2016, Part I, LNCS 9771, pp. 750–761, 2016. DOI: 10.1007/978-3-319-42291-6_75

Is There a Relationship Between Neighborhoods of Minority Class Instances

751

methods with imbalanced data sets [14–17], most of these sets are synthetic or from speciﬁc domains. The third topic groups the works that study the complexity of data sets [18, 19]. These researches are very valuable to understand why classiﬁcation methods fail with imbalanced data sets. Japkowicz and Stephen [18] realized a set of experiments to establish possible relationships between concept complexity, size of the training set and class imbalance level. Conclusions from such experiments suggest that data complexity hinders the performance of classiﬁers. Prati et al. [20] and Batista et al. [15] show that class overlapping has a more negative impact than imbalan

Data Loading...

Is There a Relationship Between Neighborhoods of Minority Class Instances and the Performance of Classification Methods?

Recommend Documents

Is there a negative relationship between the order-of-brand entry and market share?

A review of multivariate analysis: is there a relationship between airborne particulate matter and meteorological variab

Is there a relationship between the extent of tonsillar ectopia and the severity of the clinical Chiari syndrome?

Is there a relationship between Eustachian tube dysfunction and nasal septal deviation in a sample of the Lebanese popul

The relationship between minority stress and biological outcomes: A systematic review

Evaluating the Performance of Multi-Class and Single-Class Classification Approaches for Mountain Agriculture Extraction

Knowledge of the Relationship Between Breastfeeding and Breast Cancer Risk Among Racial and Ethnic Minority Women

Pregnancy-Related Stigma in the Workplace and Psychological Health: Is There a Relationship?

More Class in Management Research The Relationship between Socio

Empirical Analysis of the Relationship Between Corporate Reputation and Financial Performance: A Survey of the Literatur

Is there a doctor in the house?

Is there an association between sarcoidosis and atherosclerosis?