An improved deep forest for alleviating the data imbalance problem

PDF / 2,628,347 Bytes
17 Pages / 595.276 x 790.866 pts Page_size
83 Downloads / 225 Views

(0123456789().,-volV)(0123456789(). ,- volV)

METHODOLOGIES AND APPLICATION

An improved deep forest for alleviating the data imbalance problem Jie Gao1 • Kunhong Liu1 • Beizhan Wang1 • Dong Wang2 • Qingqi Hong1

Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract Most deep learning methods have inherent defects and are rarely applied in the classification task of small-sized imbalanced datasets. On the one hand, data imbalance causes the classification results of the model to be biased toward the majority class. On the other hand, limited training data results in over-fitting. Deep forest (DF) is an interesting deep learning model that can perfectly work on small-sized datasets, and its performance is highly competitive with deep neural networks. In the present study, a variant of the DF called the imbalanced deep forest (IMDF) is proposed to effectively improve the classification performance of the minority class. It aims to explore the application of deep learning on smallsized imbalanced datasets. The IMDF is the cascade of multiple layers, where each layer is the ensemble of multiple units. The main idea behind the proposed method is to enable each unit of the IMDF to handle imbalanced data so that the classification results of the entire IMDF are biased toward minority class. Performed experiments demonstrate the effectiveness of the proposed method. Keywords Deep forest AdaBoost SMOTE Imbalanced data

1 Introduction Rare events are normally defined as unusual patterns and abnormal behaviors that occur at a low frequency of less than 5%, or even less than 0.1% (Maher Maalouf 2011). Moreover, rare events are characterized by under-represented instances and high misclassification costs. Studies show that it is a great challenge to detect rare events in real applications, such as network intrusion detection and cancer detection. More specifically, the number of intrusions shares a very small fraction of the total network traffic. However, misclassifying a cancerous patient as a noncancerous one postpones the timely treatment and may result in a heavy cost (Chawla et al. 2003). Therefore, the

Communicated by V. Loia. & Kunhong Liu [email protected] & Beizhan Wang [email protected] 1

School of Informatics, Xiamen University, Xiamen 361005, People’s Republic of China

2

State Grid Fujian Electric Power Company, Fuzhou 350003, People’s Republic of China

nature of these applications requires a high detection rate for rare events. In the field of data mining, identifying rare events is essentially a binary classification problem of imbalanced data (Branco et al. 2016). It is normally assumed that the class with the smallest size forms the minority class (or the positive class), while other classes form the majority class (or the negative class). Imbalanced rate (IR) is an indicator, which is normally used to evaluate the skew level of the dataset. It is defined as the ratio of the number of the majority class to that of the minority class (Loyola-Gonza´lez et al. 2017). Most conventional ma

Data Loading...

An improved deep forest for alleviating the data imbalance problem

Recommend Documents

Hybrid Data-Level Techniques for Class Imbalance Problem

An improved binary programming formulation for the secure domination problem

An Improved Exact Algorithm for the Exact Satisfiability Problem

An Improved Simulated Annealing Algorithm for Traveling Salesman Problem

An Improved CMA-ES for Solving Large Scale Optimization Problem

Deep Learning Based Frameworks for Handling Imbalance in DGA, Email, and URL Data Analysis

I-SiamIDS: an improved Siam-IDS for handling class imbalance in network-based intrusion detection systems

SICE: an improved missing data imputation technique

An Improved Conditional Generative Adversarial Network for Microarray Data

K-DBSCAN: An improved DBSCAN algorithm for big data

An Inverse Gravimetric Problem with GOCE Data

A Novel Approach for Breast Cancer Data Classification Using Deep Forest Network