Towards Effective Classification of Imbalanced Data with Convolutional Neural Networks
Class imbalance in machine learning is a problem often found with real-world data, where data from one class clearly dominates the dataset. Most neural network classifiers fail to learn to classify such datasets correctly if class-to-class separability is
- PDF / 1,130,859 Bytes
- 13 Pages / 439.37 x 666.142 pts Page_size
- 78 Downloads / 220 Views
stract. Class imbalance in machine learning is a problem often found with real-world data, where data from one class clearly dominates the dataset. Most neural network classifiers fail to learn to classify such datasets correctly if class-to-class separability is poor due to a strong bias towards the majority class. In this paper we present an algorithmic solution, integrating different methods into a novel approach using a class-to-class separability score, to increase performance on poorly separable, imbalanced datasets using Cost Sensitive Neural Networks. We compare different cost functions and methods that can be used for training Convolutional Neural Networks on a highly imbalanced dataset of multi-channel time series data. Results show that, despite being imbalanced and poorly separable, performance metrics such as G-Mean as high as 92.8 % could be reached by using cost sensitive Convolutional Neural Networks to detect patterns and correctly classify time series from 3 different datasets.
1
Introduction
In supervised classification tasks, effective learning happens when there are sufficient examples for all the classes and class-to-class (C2C) separability is sufficiently large. However, real world datasets are often imbalanced and have poor C2C separability. A dataset is said to be imbalanced when a certain class is overrepresented compared to other classes in that dataset. In binary classification tasks, the class with too many examples is often referred to as the majority class, the other as the minority class respectively. Machine Learning algorithms performing classification on such datasets face the so-called ‘class imbalance problem’, where learning is not as effective as it is with a balanced dataset [6,10,13], since it poses a bias in learning towards the majority class. On the one hand, many of the real world datasets are imbalanced and on the other hand, most existing classification approaches assume that the underlying training set is evenly distributed. Furthermore, in many scenarios it is undesirable or dangerous to misclassify an example from a minority class. For example, in a continuous surveillance task, suspicious activity may occur as a rare event which is undesirable to go unnoticed by the monitoring system. In medical applications, the cost of erroneously classifying a sick person as healthy c Springer International Publishing AG 2016 F. Schwenker et al. (Eds.): ANNPR 2016, LNAI 9896, pp. 150–162, 2016. DOI: 10.1007/978-3-319-46182-3 13
Effective Classification of Imbalanced Data with CNNs
151
can have larger risk (cost) than wrongly classifying a healthy person as sick. In these cases it is crucial for classification algorithms to have a higher identification rate for rare events, that means it is critical to not misclassify any minority examples while it is acceptable to misclassify few majority examples. An extreme example for the imbalance problem would be a dataset where the area of the majority class overlaps that of the minority class completely and the overlapping region contains as many
Data Loading...