GP Classification under Imbalanced Data sets: Active Sub-sampling and AUC Approximation

The problem of evolving binary classification models under increasingly unbalanced data sets is approached by proposing a strategy consisting of two components: Sub-sampling and ‘robust’ fitness function design. In particular, recent work in the wider mac

PDF / 347,053 Bytes
12 Pages / 430 x 660 pts Page_size
65 Downloads / 235 Views

DOWNLOAD

REPORT

Abstract. The problem of evolving binary classiﬁcation models under increasingly unbalanced data sets is approached by proposing a strategy consisting of two components: Sub-sampling and ‘robust’ ﬁtness function design. In particular, recent work in the wider machine learning literature has recognized that maintaining the original distribution of exemplars during training is often not appropriate for designing classiﬁers that are robust to degenerate classiﬁer behavior. To this end we propose a ‘Simple Active Learning Heuristic’ (SALH) in which a subset of exemplars is sampled with uniform probability under a class balance enforcing rule for ﬁtness evaluation. In addition, an eﬃcient estimator for the Area Under the Curve (AUC) performance metric is assumed in the form of a modiﬁed Wilcoxon-Mann-Whitney (WMW) statistic. Performance is evaluated in terms of six representative UCI data sets and benchmarked against: canonical GP, SALH based GP, SALH and the modiﬁed WMW statistic, and deterministic classiﬁers (Naive Bayes and C4.5). The resulting SALH-WMW model is demonstrated to be both eﬃcient and eﬀective at providing solutions maximizing performance assessed in terms of AUC.

1

Introduction

Genetic Programming (GP) provides many unique opportunities for posing solutions to the basic Machine Learning design questions of representation, cost function, and credit assignment. In this work we are speciﬁcally interested in the topic of cost function design under the classiﬁcation domain of supervised learning. Classically, an equally weighted cost function is assumed, such as ‘hits’ [11] or sum square error [2]. Such a design choice might be natural under balanced binary classiﬁcation problems where each class carries an equal risk, but is questionable in the wider context of real world data sets that are frequently unbalanced. At the very least, as the class distribution becomes increasingly unbalanced, the likelihood of evolving degenerate classiﬁer behavior will increase [6], [19]. Addressing the class imbalance problem has at least two related perspectives: identiﬁcation of an appropriate cost (ﬁtness) function, and sampling the original distribution of training exemplars such that the learning algorithm adapts under a diﬀerent distribution than the original data set. M. O’Neill et al. (Eds.): EuroGP 2008, LNCS 4971, pp. 266–277, 2008. c Springer-Verlag Berlin Heidelberg 2008

GP Classiﬁcation under Imbalanced Data sets

267

In the case of sampling algorithms, several paradigms have appeared, including: (1) boosting and bagging algorithms that tend to result in multiple individuals being built relative to static resampling of the original training data, and; (2) active learning or sub-sampling algorithms that may identify a sub-sample of exemplars from the larger training data set at each training cycle. The later case is of interest in this work. In particular we begin with the observation made from Weiss and Provost (under decision tree induction) [20]; that is, robust classiﬁers may be built relative to the po

Data Loading...

GP Classification under Imbalanced Data sets: Active Sub-sampling and AUC Approximation

Recommend Documents

Employing Decision Templates to Imbalanced Data Classification

Imbalanced Data Classification: A Novel Re-sampling Approach Combining Versatile Improved SMOTE and Rough Sets

The classification of imbalanced large data sets based on MapReduce and ensemble of ELM classifiers

Imbalanced Data Stream Classification Using Hybrid Data Preprocessing

Imbalanced Data Classification Method Based on Clustering and Voting Mechanism

Area Under the Curve (AUC)

New Function for Estimating Imbalanced Data Classification Results

RUSDataBoost-IM: Improving Classification Performance in Imbalanced Data

A design of information granule-based under-sampling method in imbalanced data classification

Towards Effective Classification of Imbalanced Data with Convolutional Neural Networks

Data Preprocessing and Dynamic Ensemble Selection for Imbalanced Data Stream Classification

Automated Classification and Analysis of Non-metallic Inclusion Data Sets