An apparent paradox: a classifier based on a partially classified sample may have smaller expected error rate than that

  • PDF / 1,283,767 Bytes
  • 12 Pages / 595.276 x 790.866 pts Page_size
  • 97 Downloads / 140 Views

DOWNLOAD

REPORT


An apparent paradox: a classifier based on a partially classified sample may have smaller expected error rate than that if the sample were completely classified Daniel Ahfock1 · Geoffrey J. McLachlan1 Received: 14 January 2020 / Accepted: 27 August 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract There has been increasing interest in using semi-supervised learning to form a classifier. As is well known, the (Fisher) information in an unclassified feature with unknown class label is less (considerably less for weakly separated classes) than that of a classified feature which has known class label. Hence in the case where the absence of class labels does not depend on the data, the expected error rate of a classifier formed from the classified and unclassified features in a partially classified sample is greater than that if the sample were completely classified. We propose to treat the labels of the unclassified features as missing data and to introduce a framework for their missingness as in the pioneering work of Rubin (Biometrika 63:581–592, 1976) for missingness in incomplete data analysis. An examination of several partially classified data sets in the literature suggests that the unclassified features are not occurring at random in the feature space, but rather tend to be concentrated in regions of relatively high entropy. It suggests that the missingness of the labels of the features can be modelled by representing the conditional probability of a missing label for a feature via the logistic model with covariate depending on the entropy of the feature or an appropriate proxy for it. We consider here the case of two normal classes with a common covariance matrix where for computational convenience the square of the discriminant function is used as the covariate in the logistic model in place of the negative log entropy. Rather paradoxically, we show that the classifier so formed from the partially classified sample may have smaller expected error rate than that if the sample were completely classified. Keywords Normal discrimination · Semi-supervised learning · Model for missing-class labels · Relative efficiency of classifiers

1 Introduction We consider the problem of forming a classifier from training data that are not completely classified. That is, the feature vectors in the training sample have all been observed, but their This research was funded by the Australian Government through the Australian Research Council (Project Numbers DP170100907 and IC170100035). Electronic supplementary material The online version of this article (https://doi.org/10.1007/s11222-020-09971-5) contains supplementary material, which is available to authorized users.

B

Geoffrey J. McLachlan [email protected] Daniel Ahfock [email protected]

1

School of Mathematics and Physics, University of Queensland, St. Lucia, QLD 4072, Australia

class labels are missing for some of them and so the training data constitute a partially classified sample. This problem goes back at least to the mid-se