When is Undersampling Effective in Unbalanced Classification Tasks?
A well-known rule of thumb in unbalanced classification recommends the rebalancing (typically by resampling) of the classes before proceeding with the learning of the classifier. Though this seems to work for the majority of cases, no detailed analysis ex
- PDF / 1,663,945 Bytes
- 16 Pages / 439.37 x 666.142 pts Page_size
- 34 Downloads / 238 Views
Machine Learning Group (MLG), Computer Science Department, Faculty of Sciences ULB, Universit´e Libre de Bruxelles, Brussels, Belgium [email protected] 2 Fraud Risk Management Analytics, Worldline, Brussels, Belgium Interuniversity Institute of Bioinformatics in Brussels (IB)2 , Brussels, Belgium
Abstract. A well-known rule of thumb in unbalanced classification recommends the rebalancing (typically by resampling) of the classes before proceeding with the learning of the classifier. Though this seems to work for the majority of cases, no detailed analysis exists about the impact of undersampling on the accuracy of the final classifier. This paper aims to fill this gap by proposing an integrated analysis of the two elements which have the largest impact on the effectiveness of an undersampling strategy: the increase of the variance due to the reduction of the number of samples and the warping of the posterior distribution due to the change of priori probabilities. In particular we will propose a theoretical analysis specifying under which conditions undersampling is recommended and expected to be effective. It emerges that the impact of undersampling depends on the number of samples, the variance of the classifier, the degree of imbalance and more specifically on the value of the posterior probability. This makes difficult to predict the average effectiveness of an undersampling strategy since its benefits depend on the distribution of the testing points. Results from several synthetic and real-world unbalanced datasets support and validate our findings. Keywords: Undersampling classification
1
·
Ranking
·
Class overlap
·
Unbalanced
Introduction
In several binary classification problems, the two classes are not equally represented in the dataset. For example, in fraud detection, fraudulent transactions are normally outnumbered by genuine ones [5]. When one class is underrepresented in a dataset, the data is said to be unbalanced. In such problems, typically, the minority class is the class of interest. Having few instances of one class means that the learning algorithm is often unable to generalize the behavior of the minority class well, hence the algorithm performs poorly in terms of predictive accuracy [14]. When the data is unbalanced, standard machine learning algorithms that maximise overall accuracy tend to classify all observations as majority class c Springer International Publishing Switzerland 2015 A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 200–215, 2015. DOI: 10.1007/978-3-319-23528-8 13
When is Undersampling Effective in Unbalanced Classification Tasks?
201
instances. This translates into poor accuracy on the minority class (low recall), which is typically the class of interest. Degradation of classification performance is not only related to a small number of examples in the minority class in comparison to the number of examples in the majority classes (expressed by the class imbalance ratio), but also to the minority class decomposition into small sub-parts [19] (also known in the lit
Data Loading...