The classification of imbalanced large data sets based on MapReduce and ensemble of ELM classifiers
- PDF / 1,133,549 Bytes
- 9 Pages / 595.276 x 790.866 pts Page_size
- 62 Downloads / 254 Views
ORIGINAL ARTICLE
The classification of imbalanced large data sets based on MapReduce and ensemble of ELM classifiers Junhai Zhai1,2 • Sufang Zhang3 • Chenxi Wang4
Received: 19 September 2015 / Accepted: 8 December 2015 Ó Springer-Verlag Berlin Heidelberg 2015
Abstract Aiming at effectively classifying imbalanced large data sets with two classes, this paper proposed a novel algorithm, which consists of four stages: (1) alternately over-sample p times between positive class instances and negative class instances; (2) construct l balanced data subsets based on the generated positive class instances; (3) train l component classifiers with extreme learning machine (ELM) algorithm on the constructed l balanced data subsets; (4) integrate the l ELM classifiers with simple voting approach. Specifically, in first stage, we firstly calculate the center of positive class instances, and then sample instance points along the line between the center and each positive class instance. Next, for each instance point in the new positive class, we firstly find its k nearest neighbors in negative class instances with MapRedcue, and then sample instance points along the line between the instance and its k nearest negative neighbors. The process of over-sampling is repeated p times. In the second stage, we sample instances l times from the negative class with the same size as the generated positive class instances. Each round of sampling, we put positive class and negative & Junhai Zhai [email protected] 1
Key Laboratory of Machine Learning and Computational Intelligence, College of Mathematics and Information Science, Hebei University, Baoding 071002, Hebei, China
2
College of Mathematics, Physics and Information Engineering, Zhejiang Normal University, Jinhua 321004, China
3
Hebei Branch of Meteorological Cadres Training Institute, China Meteorological Administration, Baoding 071000, China
4
College of Computer Science and Technology, Hebei University, Baoding 071002, Hebei, China
class instances together thus obtain l balanced data subsets. The experimental results show that the proposed algorithm can obtain promising speed-up and scalability, and also outperforms three other ensemble algorithms in G-mean. Keywords Imbalanced large data sets MapReduce Extreme learning machine Ensemble learning Majority voting method
1 Introduction A two-class data set is imbalanced when one of the classes (the minority one, usually called positive class in the literature) is heavily under-represented regarding the other class (the majority one, usually called negative class) [1]. Paradoxically, the minority class is often the more important and usually the one with the higher misclassification costs. Recently, the classification of imbalanced data sets is a hot research topic in machine learning and data mining, because that there exist many imbalanced data sets in practical application fields, such as, medical diagnosis data, credit card fraud detection data, network intrusion detection data, etc. It is timely and very meaningful i
Data Loading...