SMOTE-WENN: Solving class imbalance and small sample problems by oversampling and distance scaling

  • PDF / 3,355,828 Bytes
  • 16 Pages / 595.224 x 790.955 pts Page_size
  • 58 Downloads / 156 Views

DOWNLOAD

REPORT


SMOTE-WENN: Solving class imbalance and small sample problems by oversampling and distance scaling Hongjiao Guan1,2,3 · Yingtao Zhang3 · Min Xian4 · H. D. Cheng5 · Xianglong Tang3

© Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Many practical applications suffer from imbalanced data classification, in which case the minority class has degraded recognition rate. The primary causes are the sample scarcity of the minority class and the intrinsic complex distribution characteristics of imbalanced datasets. The imbalanced classification problem is more serious on small sample datasets. To solve the problems of small sample and class imbalance, a hybrid resampling method is proposed. The proposed method combines an oversampling approach (synthetic minority oversampling technique, SMOTE) and a novel data cleaning approach (weighted edited nearest neighbor rule, WENN). First, SMOTE generates synthetic minority class examples using linear interpolation. Then, WENN detects and deletes unsafe majority and minority class examples using weighted distance function and k-nearest neighbor (kNN) rule. The weighted distance function scales up a commonly used distance by considering local imbalance and spacial sparsity. Extensive experiments over synthetic and real datasets validate the superiority of the proposed SMOTE-WENN compared with three state-of-the-art resampling methods. Keywords Imbalanced data classification · Small sample datasets · Oversampling · Data cleaning

1 Introduction Imbalanced data classification is common in practical applications [1, 2], such as medical diagnosis, defect prediction, etc. It has always been a challenge in data mining and machine learning. For two-class datasets, imbalance means that the number of one class (called positive or minority class) is far less than that of the other class (called negative or majority class). Imbalanced datasets lead to  Hongjiao Guan

[email protected] 1

School of Cyber Security, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China

2

Shandong Computer Science Center (National Supercomputer Center in Jinan), Shandong Provincial Key Laboratory of Computer Networks, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250014, China

3

School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China

4

Department of Computer Science, University of Idaho, Idaho Falls, USA

5

School of Computer Science, Utah State University, Logan UT, 84322, USA

the performance deterioration of traditional classification methods. Especially, the recognition rate of the minority class decreases seriously. However, the minority class is of our interest from an application point of view. Furthermore, the misclassification cost of the minority class is usually higher than that of the majority class. The imbalanced classification problem can be explained in two aspects. One is due to inappropriate optimization metrics in traditional learning algorithms. These algorithms as