Imbalanced Data Classification: A Novel Re-sampling Approach Combining Versatile Improved SMOTE and Rough Sets

In recent years, the problem of learning from imbalanced data has emerged as important and challenging. The fact that one of the classes is underrepresented in the data set is not the only reason of difficulties. The complex distribution of data, especial

PDF / 1,004,520 Bytes
12 Pages / 439.37 x 666.142 pts Page_size
85 Downloads / 242 Views

DOWNLOAD

REPORT

ets

·

Introduction

Proper classiﬁcation of imbalanced data is one of the most challenging problems in data mining. Since wide range of real-world domains suﬀers from this issue, it is crucial to ﬁnd more and more eﬀective techniques to deal with it. The fundamental reason of diﬃculties is the fact that one class (positive, minority) is underrepresented in the data set. Furthermore, the correct recognition of examples belonging to this particular class is a matter of major interest. Considering domains like medical diagnostic, anomaly detection, fault diagnosis, detection of oil spills, risk management and fraud detection [8,21] the misclassiﬁcation cost of rare cases is obviously very high. The small subset of data describing disease cases is more meaningful than remaining majority of objects representing healthy population. Therefore, the dedicated algorithms should be applied to recognizing minority class instances in these areas. c IFIP International Federation for Information Processing 2016 Published by Springer International Publishing Switzerland 2016. All Rights Reserved K. Saeed and W. Homenda (Eds.): CISIM 2016, LNCS 9842, pp. 31–42, 2016. DOI: 10.1007/978-3-319-45378-1 4

32

K. Borowska and J. Stepaniuk

Over the last years the researchers’ growing interest in imbalanced data contributed to considerable advancements in this ﬁeld. Numerous methods were proposed to address this problem. They are grouped into three main categories [8,21]: – data-level techniques: adding the preliminary step of data processing - assumes mainly undersampling and oversampling, – algorithm-level approaches: modiﬁcations of existing algorithms, – cost-sensitive methods: combining data-level and algorithm-level techniques to set diﬀerent misclassiﬁcation costs. In this paper we focus on data-level approaches: generating new minority class samples (oversampling) and introducing additional cleaning step (undersampling). Creating new examples of the minority class requires careful analysis of the data distribution. Random replication of the positive instances may lead to overﬁtting [8]. Furthermore, even applying methods like Synthetic Minority Oversampling Technique [5] (creation of new samples by interpolating several minority class examples that lie together) may not be suﬃcient for variety of real-life domains. Indeed, the main reason of diﬃculties in learning from imbalanced data is the complex distribution: existence of class overlapping, noise or small disjuncts [8,11,13,15]. The VIS algorithm [4], incorporated into the proposed approach, addresses listed problems by applying dedicated mechanism for each speciﬁc group of minority class examples. Assigning objects into categories is based on their local characteristics. Although this solution considers additional diﬃculties, in case of eminently complex problems it may contribute to creation of noisy objects. Hence, the clearing mechanism is introduced as the second step of preprocessing. On the other hand, new preliminary step deals with uncertainty by relabeling ambiguo

Data Loading...

Imbalanced Data Classification: A Novel Re-sampling Approach Combining Versatile Improved SMOTE and Rough Sets

Recommend Documents

A novel classification algorithm based on kernelized fuzzy rough sets

A Preprocessing Approach for Class-Imbalanced Data Using SMOTE and Belief Function Theory

Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing 9th

Data Mining, Rough Sets and Granular Computing

Employing Decision Templates to Imbalanced Data Classification

The classification of imbalanced large data sets based on MapReduce and ensemble of ELM classifiers

GP Classification under Imbalanced Data sets: Active Sub-sampling and AUC Approximation

Fuzzy Sets, Rough Sets, Multisets and Clustering

Imbalanced Data Stream Classification Using Hybrid Data Preprocessing

Machine learning based novel cost-sensitive seizure detection classifier for imbalanced EEG data sets

Similarity-based Rough Sets and Its Applications in Data Mining

Robust hybrid data-level sampling approach to handle imbalanced data during classification