An improved deep forest for alleviating the data imbalance problem
- PDF / 2,628,347 Bytes
- 17 Pages / 595.276 x 790.866 pts Page_size
- 83 Downloads / 194 Views
(0123456789().,-volV)(0123456789(). ,- volV)
METHODOLOGIES AND APPLICATION
An improved deep forest for alleviating the data imbalance problem Jie Gao1 • Kunhong Liu1 • Beizhan Wang1 • Dong Wang2 • Qingqi Hong1
Springer-Verlag GmbH Germany, part of Springer Nature 2020
Abstract Most deep learning methods have inherent defects and are rarely applied in the classification task of small-sized imbalanced datasets. On the one hand, data imbalance causes the classification results of the model to be biased toward the majority class. On the other hand, limited training data results in over-fitting. Deep forest (DF) is an interesting deep learning model that can perfectly work on small-sized datasets, and its performance is highly competitive with deep neural networks. In the present study, a variant of the DF called the imbalanced deep forest (IMDF) is proposed to effectively improve the classification performance of the minority class. It aims to explore the application of deep learning on smallsized imbalanced datasets. The IMDF is the cascade of multiple layers, where each layer is the ensemble of multiple units. The main idea behind the proposed method is to enable each unit of the IMDF to handle imbalanced data so that the classification results of the entire IMDF are biased toward minority class. Performed experiments demonstrate the effectiveness of the proposed method. Keywords Deep forest AdaBoost SMOTE Imbalanced data
1 Introduction Rare events are normally defined as unusual patterns and abnormal behaviors that occur at a low frequency of less than 5%, or even less than 0.1% (Maher Maalouf 2011). Moreover, rare events are characterized by under-represented instances and high misclassification costs. Studies show that it is a great challenge to detect rare events in real applications, such as network intrusion detection and cancer detection. More specifically, the number of intrusions shares a very small fraction of the total network traffic. However, misclassifying a cancerous patient as a noncancerous one postpones the timely treatment and may result in a heavy cost (Chawla et al. 2003). Therefore, the
Communicated by V. Loia. & Kunhong Liu [email protected] & Beizhan Wang [email protected] 1
School of Informatics, Xiamen University, Xiamen 361005, People’s Republic of China
2
State Grid Fujian Electric Power Company, Fuzhou 350003, People’s Republic of China
nature of these applications requires a high detection rate for rare events. In the field of data mining, identifying rare events is essentially a binary classification problem of imbalanced data (Branco et al. 2016). It is normally assumed that the class with the smallest size forms the minority class (or the positive class), while other classes form the majority class (or the negative class). Imbalanced rate (IR) is an indicator, which is normally used to evaluate the skew level of the dataset. It is defined as the ratio of the number of the majority class to that of the minority class (Loyola-Gonza´lez et al. 2017). Most conventional ma
Data Loading...