New Function for Estimating Imbalanced Data Classification Results

  • PDF / 640,582 Bytes
  • 8 Pages / 612 x 792 pts (letter) Page_size
  • 71 Downloads / 249 Views

DOWNLOAD

REPORT


New Function for Estimating Imbalanced Data Classification Results V. V. Starovoitova,* and Yu. I. Goluba,** a

United Institute of Informatics Problems, National Academy of Sciences of Belarus, Minsk, 220012 Belarus *e-mail: [email protected] **e-mail: [email protected]

Abstract—In this paper, we propose a new function for estimating the quality of classification into N classes. This function is invariant to the imbalance of classes to be processed. It is constructed by computing the sine of an angle formed by the errors of each class in an N-dimensional space. A geometrical substantiation of its construction is provided and its properties are investigated. It is shown that this function is an improved version of the balanced accuracy function. In contrast to other functions, the proposed function considers class distribution of errors. Examples of analyzing the confusion matrices in the classification of synthetic and realworld data are provided. Keywords: classification of imbalanced data, confusion matrix, classification accuracy functions DOI: 10.1134/S105466182003027X

1. INTRODUCTION Confusion matrices are often employed to estimate the results of classification of imbalanced data with the following estimation functions being evaluated based on these matrices: area under the ROC curve (232 or 44.00%), accuracy (201 or 38.14%), geometric mean (G-mean) (156 or 29.60%), harmonic mean (F-score) (144 or 27.32%), sensitivity (83 or 15.74%), specificity (69 or 13.09%), precision (63 or 11.95%), balanced accuracy (8 or 1.15%), and Matthews correlation coefficient (6 or 1.14%). Here, the parentheses indicate the number of papers (527 in total, published in 192 journals during 2006–2016 and described in [1]) in which these functions were used. All of them, except the last one, are average generalizations of the same-name functions for estimating the quality of binary classification. The Matthews function is a discrete version of the Pearson correlation coefficient. In [2], a comparative analysis of quality estimates for binary classification was carried out. It was shown that, in binary classification and quality estimation based on confusion matrices, the area under the ROC curve and balanced accuracy coincide, being the best options for estimating the classification of imbalanced data. In this paper, we propose a brand new function for estimating the quality of multiclass classification that, in contrast to the well-known functions, is invariant to the imbalance of data to be classified and takes into account the spread of misclassified instances. In practice, data to be classified are not class balanced. For instance, the number of patients with can-

cer or tuberculosis is significantly lower than the number of healthy people, the number of insects exceeds the number of birds, while the number of mammals is even smaller. However, when developing algorithms for classifying people into healthy and sick (while determining the stage of a disease) or classifying images of animals, given an imbalanced training