The classification of imbalanced large data sets based on MapReduce and ensemble of ELM classifiers

PDF / 1,133,549 Bytes
9 Pages / 595.276 x 790.866 pts Page_size
62 Downloads / 271 Views

ORIGINAL ARTICLE

The classification of imbalanced large data sets based on MapReduce and ensemble of ELM classifiers Junhai Zhai1,2 • Sufang Zhang3 • Chenxi Wang4

Received: 19 September 2015 / Accepted: 8 December 2015 Ó Springer-Verlag Berlin Heidelberg 2015

Abstract Aiming at effectively classifying imbalanced large data sets with two classes, this paper proposed a novel algorithm, which consists of four stages: (1) alternately over-sample p times between positive class instances and negative class instances; (2) construct l balanced data subsets based on the generated positive class instances; (3) train l component classifiers with extreme learning machine (ELM) algorithm on the constructed l balanced data subsets; (4) integrate the l ELM classifiers with simple voting approach. Specifically, in first stage, we firstly calculate the center of positive class instances, and then sample instance points along the line between the center and each positive class instance. Next, for each instance point in the new positive class, we firstly find its k nearest neighbors in negative class instances with MapRedcue, and then sample instance points along the line between the instance and its k nearest negative neighbors. The process of over-sampling is repeated p times. In the second stage, we sample instances l times from the negative class with the same size as the generated positive class instances. Each round of sampling, we put positive class and negative & Junhai Zhai [email protected] 1

Key Laboratory of Machine Learning and Computational Intelligence, College of Mathematics and Information Science, Hebei University, Baoding 071002, Hebei, China

2

College of Mathematics, Physics and Information Engineering, Zhejiang Normal University, Jinhua 321004, China

3

Hebei Branch of Meteorological Cadres Training Institute, China Meteorological Administration, Baoding 071000, China

4

College of Computer Science and Technology, Hebei University, Baoding 071002, Hebei, China

class instances together thus obtain l balanced data subsets. The experimental results show that the proposed algorithm can obtain promising speed-up and scalability, and also outperforms three other ensemble algorithms in G-mean. Keywords Imbalanced large data sets MapReduce Extreme learning machine Ensemble learning Majority voting method

1 Introduction A two-class data set is imbalanced when one of the classes (the minority one, usually called positive class in the literature) is heavily under-represented regarding the other class (the majority one, usually called negative class) [1]. Paradoxically, the minority class is often the more important and usually the one with the higher misclassification costs. Recently, the classification of imbalanced data sets is a hot research topic in machine learning and data mining, because that there exist many imbalanced data sets in practical application fields, such as, medical diagnosis data, credit card fraud detection data, network intrusion detection data, etc. It is timely and very meaningful i

Data Loading...

The classification of imbalanced large data sets based on MapReduce and ensemble of ELM classifiers

Recommend Documents

Large margin classifiers to generate synthetic data for imbalanced datasets

Imbalanced Data Classification Method Based on Clustering and Voting Mechanism

Data Preprocessing and Dynamic Ensemble Selection for Imbalanced Data Stream Classification

MUEnsemble: Multi-ratio Undersampling-Based Ensemble Framework for Imbalanced Data

Classification of Diffuse Lung Diseases Using Heterogeneous Ensemble Classifiers

Employing Decision Templates to Imbalanced Data Classification

Medical Data Classification Using Jaya Optimized ELM

Imbalanced Data Classification: A Novel Re-sampling Approach Combining Versatile Improved SMOTE and Rough Sets

GP Classification under Imbalanced Data sets: Active Sub-sampling and AUC Approximation

Decision-based evasion attacks on tree ensemble classifiers

Imbalanced Data Stream Classification Using Hybrid Data Preprocessing

Application of ELM-MapReduce Technique in Stock Market Forecasting