A Non-intrusive Correction Algorithm for Classification Problems with Corrupted Data

  • PDF / 2,535,814 Bytes
  • 20 Pages / 439.37 x 666.142 pts Page_size
  • 14 Downloads / 171 Views

DOWNLOAD

REPORT


A Non‑intrusive Correction Algorithm for Classification Problems with Corrupted Data Jun Hou1 · Tong Qin1 · Kailiang Wu1 · Dongbin Xiu1 Received: 10 February 2020 / Revised: 2 June 2020 / Accepted: 13 June 2020 © Shanghai University 2020

Abstract A novel correction algorithm is proposed for multi-class classification problems with corrupted training data. The algorithm is non-intrusive, in the sense that it post-processes a trained classification model by adding a correction procedure to the model prediction. The correction procedure can be coupled with any approximators, such as logistic regression, neural networks of various architectures, etc. When the training dataset is sufficiently large, we theoretically prove (in the limiting case) and numerically show that the corrected models deliver correct classification results as if there is no corruption in the training data. For datasets of finite size, the corrected models produce significantly better recovery results, compared to the models without the correction algorithm. All of the theoretical findings in the paper are verified by our numerical examples. Keywords  Data corruption · Deep neural network · Cross-entropy · Label corruption · Robust loss Mathematics Subject Classification  62-08 · 68P30 · 68R01

1 Introduction Classification problems arise in many practical applications, such as the image classification, the speech recognition, the spam filtering, and so on. Over the past decades, classification has been widely studied using machine learning techniques, which seek to learn a classifier from labeled training dataset to predict class labels for new data. However, realworld datasets often contain noise and their class labels can be corrupted, i.e., mislabelled. * Dongbin Xiu [email protected] Jun Hou [email protected] Tong Qin [email protected] Kailiang Wu [email protected] 1



Department of Mathematics, The Ohio State University, Columbus, OH 43210, USA

13

Vol.:(0123456789)



Communications on Applied Mathematics and Computation

This can be caused by a variety of reasons, including the human error, the measurement error, the subjective bias by labelers, etc. Label corruptions also occur in data poisoning [17, 31]. For a more comprehensive review of the sources of label corruptions, see Section B of [5]. Label corruptions, natural or malicious, can adversely impact the classification performance of classifiers. See, for example, [27, 38, 40] for impacts on different machine learning techniques. It is, therefore, important to explore robust techniques that can mitigate, or even eliminate, the consequences of label corruptions.

1.1 Related Work There exist a large amount of literature on learning of classifiers in the presence of label noises/errors. See, for example, [5] for a detailed survey. Methods to enhance the model robustness against label noises include modifying the network architecture and introducing corrections to the loss function [11, 15, 29]. Larsen et al. [15] proposed a framework for designing robust neural network (NN) classifiers by