New Function for Estimating Imbalanced Data Classification Results

PDF / 640,582 Bytes
8 Pages / 612 x 792 pts (letter) Page_size
71 Downloads / 259 Views

New Function for Estimating Imbalanced Data Classification Results V. V. Starovoitova,* and Yu. I. Goluba,** a

United Institute of Informatics Problems, National Academy of Sciences of Belarus, Minsk, 220012 Belarus *e-mail: [email protected] **e-mail: [email protected]

Abstract—In this paper, we propose a new function for estimating the quality of classification into N classes. This function is invariant to the imbalance of classes to be processed. It is constructed by computing the sine of an angle formed by the errors of each class in an N-dimensional space. A geometrical substantiation of its construction is provided and its properties are investigated. It is shown that this function is an improved version of the balanced accuracy function. In contrast to other functions, the proposed function considers class distribution of errors. Examples of analyzing the confusion matrices in the classification of synthetic and realworld data are provided. Keywords: classification of imbalanced data, confusion matrix, classification accuracy functions DOI: 10.1134/S105466182003027X

1. INTRODUCTION Confusion matrices are often employed to estimate the results of classification of imbalanced data with the following estimation functions being evaluated based on these matrices: area under the ROC curve (232 or 44.00%), accuracy (201 or 38.14%), geometric mean (G-mean) (156 or 29.60%), harmonic mean (F-score) (144 or 27.32%), sensitivity (83 or 15.74%), specificity (69 or 13.09%), precision (63 or 11.95%), balanced accuracy (8 or 1.15%), and Matthews correlation coefficient (6 or 1.14%). Here, the parentheses indicate the number of papers (527 in total, published in 192 journals during 2006–2016 and described in [1]) in which these functions were used. All of them, except the last one, are average generalizations of the same-name functions for estimating the quality of binary classification. The Matthews function is a discrete version of the Pearson correlation coefficient. In [2], a comparative analysis of quality estimates for binary classification was carried out. It was shown that, in binary classification and quality estimation based on confusion matrices, the area under the ROC curve and balanced accuracy coincide, being the best options for estimating the classification of imbalanced data. In this paper, we propose a brand new function for estimating the quality of multiclass classification that, in contrast to the well-known functions, is invariant to the imbalance of data to be classified and takes into account the spread of misclassified instances. In practice, data to be classified are not class balanced. For instance, the number of patients with can-

cer or tuberculosis is significantly lower than the number of healthy people, the number of insects exceeds the number of birds, while the number of mammals is even smaller. However, when developing algorithms for classifying people into healthy and sick (while determining the stage of a disease) or classifying images of animals, given an imbalanced training

Data Loading...

New Function for Estimating Imbalanced Data Classification Results

Recommend Documents

Employing Decision Templates to Imbalanced Data Classification

Genetic programming for high-dimensional imbalanced classification with a new fitness function and program reuse mechani

Imbalanced Data Stream Classification Using Hybrid Data Preprocessing

RUSDataBoost-IM: Improving Classification Performance in Imbalanced Data

Towards Effective Classification of Imbalanced Data with Convolutional Neural Networks

Imbalanced Data Classification Method Based on Clustering and Voting Mechanism

Data Preprocessing and Dynamic Ensemble Selection for Imbalanced Data Stream Classification

Optimal Estimating Function Theory

Empirical Assessment of Performance Measures for Preprocessing Moments in Imbalanced Data Classification Problem

Robust hybrid data-level sampling approach to handle imbalanced data during classification

Classification of Multi-class Imbalanced Data Streams Using a Dynamic Data-Balancing Technique

Boosting methods for multi-class imbalanced data classification: an experimental review