Robust hybrid data-level sampling approach to handle imbalanced data during classification

  • PDF / 4,825,218 Bytes
  • 18 Pages / 595.276 x 790.866 pts Page_size
  • 3 Downloads / 199 Views

DOWNLOAD

REPORT


METHODOLOGIES AND APPLICATION

Robust hybrid data-level sampling approach to handle imbalanced data during classification Prabhjot Kaur1 · Anjana Gosain2

© Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract Classification process is significant in finding different patterns from data. The performance of classifiers is highly affected with many data impurities like imbalance data, noise, class overlapping and different distributions of data within classes. The data in the real-world applications are often corrupted with multiple data impurities. To handle this issue, this paper proposed a hybrid data-level method to handle multiple data impurities like class imbalance, noise and different data distributions within classes. The proposed approach works in phases; in the first phase, it identifies and removes noise from the data, and then, it detects minority and majority cluster by using kernel-based fuzzy clustering approach. Radial basis kernel is used for clustering. In the next phase, minority and majority clusters are processed to balance the data. It uses radial basis kernel fuzzy membership and α-cut to reduce the data size of majority cluster- and firefly-based SMOTE method to intelligently produce synthetic data within minority cluster. After removing all the data impurities, a traditional classifier (Decision Tree) is used to classify the balanced data. Performance of proposed method is tested with 3 synthetic data-sets and 44 UCI real-world data-sets of different imbalance ratios (imbalance ratio varies from 1.82 to 129.44). Area under the ROC curve is used to assess and compare the performance of proposed method with 20 other data-level methods. Experimental results confirmed that proposed method outperformed every other method especially in the case of highly imbalanced data-set. Keywords Class imbalance · Outlier identification · Kernel approach · Hybrid approach · Firefly concept · Imbalanced data-set

1 Introduction Classification is a popular technique used to detect different patterns from data-set. So many classification approaches are developed which are used in many domains to solve reallife problems successfully (Zhao et al. 2020; Ramesh et al. 2015). Sometimes, real-time data deals with data impurities like imbalance data, noise, overlapping classes, different data distributions within classes, etc. Data becomes imbalance when the size of classes to be detected is different, i.e., the size of one class is different from another class, e.g., credit card frauds, fraudulent telephone class, shuttle system failure, oil spill detection, web spam detection, etc. (Mollineda Communicated by V. Loia.

B

Prabhjot Kaur [email protected]

1

Maharaja Surajmal Institute of Technology, GGSIP University, New Delhi, India

2

USICT, GGSIP University, New Delhi, India

et al. 2007; Yong 2012), wherein we are always interested in identifying smaller (minority) class from the whole data-set. The extent of imbalance is measured with the class imbalance ratio which is the fraction of size of