A design of information granule-based under-sampling method in imbalanced data classification

PDF / 541,820 Bytes
15 Pages / 595.276 x 790.866 pts Page_size
118 Downloads / 397 Views

(0123456789().,-volV)(0123456789(). ,- volV)

METHODOLOGIES AND APPLICATION

A design of information granule-based under-sampling method in imbalanced data classification Tianyu Liu1 • Xiubin Zhu1,5

•

Witold Pedrycz1,2,4 • Zhiwu Li1,3

Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract In numerous real-world problems, we are faced with difficulties in learning from imbalanced data. The classification performance of a ‘‘standard’’ classifier (learning algorithm) is evidently hindered by the imbalanced distribution of data. The oversampling and under-sampling methods have been researched extensively with the aim to increase the predication accuracy over the minority class. However, traditional under-sampling methods tend to ignore important characteristics pertinent to the majority class. In this paper, a novel under-sampling method based on information granules is proposed. The method exploits the concepts and algorithms of granular computing. First, information granules are built around the selected patterns coming from the majority class to capture the essence of the data belonging to this class. In the sequel, the resultant information granules are evaluated in terms of their quality and those with the highest specificity values are selected. Next, the selected numeric data are augmented by some weights implied by the size of information granules. Finally, a support vector machine and a K-nearestneighbor classifier, both being regarded here as representative classifiers, are built based on the weighted data. Experimental studies are carried out using synthetic data as well as a suite of imbalanced data sets coming from the public machine learning repositories. The experimental results quantify the performance of support vector machine and K-nearest-neighbor with undersampling method based on information granules. The results demonstrate the superiority of the performance obtained for these classifiers endowed with conventional under-sampling method. In general, the improvement of performance expressed in terms of G-means is over 10% when applying information granule under-sampling compared with random under-sampling. Keywords Imbalanced data Information granule Support vector machine (SVM) K-nearest-neighbor (KNN) Under-sampling

1 Introduction With the rapid developments of science and technology, raw data have grown significantly at a rapid pace, which gives rise to imbalanced data (where the number of

instances of one class is far outnumbered by the number of instances coming from the other classes). Many learning techniques based on clustering or classification concepts have been developed (Abualigah and Hanandeh 2015; Abualigah and Khader 2017; Abualigah et al. 2017, 2018a, b, c; Abualigah 2018). However, traditional classification techniques usually assume a uniform

Communicated by V. Loia. & Xiubin Zhu [email protected] Tianyu Liu [email protected] Witold Pedrycz [email protected] Zhiwu Li [email protected] 1

2

Department of Electrical and Computer Engin

Data Loading...

A design of information granule-based under-sampling method in imbalanced data classification

Recommend Documents

Overlap-Based Undersampling Method for Classification of Imbalanced Medical Datasets

MUEnsemble: Multi-ratio Undersampling-Based Ensemble Framework for Imbalanced Data

Imbalanced Data Classification Method Based on Clustering and Voting Mechanism

Employing Decision Templates to Imbalanced Data Classification

Imbalanced Data Stream Classification Using Hybrid Data Preprocessing

Combined oversampling and undersampling method based on slow-start algorithm for imbalanced network traffic

RUSDataBoost-IM: Improving Classification Performance in Imbalanced Data

Towards Effective Classification of Imbalanced Data with Convolutional Neural Networks

Classification of Multi-class Imbalanced Data Streams Using a Dynamic Data-Balancing Technique

New Function for Estimating Imbalanced Data Classification Results

When is Undersampling Effective in Unbalanced Classification Tasks?

A Study on Imbalanced Data Streams