A Non-intrusive Correction Algorithm for Classification Problems with Corrupted Data

PDF / 2,535,814 Bytes
20 Pages / 439.37 x 666.142 pts Page_size
14 Downloads / 195 Views

A Non‑intrusive Correction Algorithm for Classification Problems with Corrupted Data Jun Hou1 · Tong Qin1 · Kailiang Wu1 · Dongbin Xiu1 Received: 10 February 2020 / Revised: 2 June 2020 / Accepted: 13 June 2020 © Shanghai University 2020

Abstract A novel correction algorithm is proposed for multi-class classification problems with corrupted training data. The algorithm is non-intrusive, in the sense that it post-processes a trained classification model by adding a correction procedure to the model prediction. The correction procedure can be coupled with any approximators, such as logistic regression, neural networks of various architectures, etc. When the training dataset is sufficiently large, we theoretically prove (in the limiting case) and numerically show that the corrected models deliver correct classification results as if there is no corruption in the training data. For datasets of finite size, the corrected models produce significantly better recovery results, compared to the models without the correction algorithm. All of the theoretical findings in the paper are verified by our numerical examples. Keywords Data corruption · Deep neural network · Cross-entropy · Label corruption · Robust loss Mathematics Subject Classification 62-08 · 68P30 · 68R01

1 Introduction Classification problems arise in many practical applications, such as the image classification, the speech recognition, the spam filtering, and so on. Over the past decades, classification has been widely studied using machine learning techniques, which seek to learn a classifier from labeled training dataset to predict class labels for new data. However, realworld datasets often contain noise and their class labels can be corrupted, i.e., mislabelled. * Dongbin Xiu [email protected] Jun Hou [email protected] Tong Qin [email protected] Kailiang Wu [email protected] 1

Department of Mathematics, The Ohio State University, Columbus, OH 43210, USA

13

Vol.:(0123456789)

Communications on Applied Mathematics and Computation

This can be caused by a variety of reasons, including the human error, the measurement error, the subjective bias by labelers, etc. Label corruptions also occur in data poisoning [17, 31]. For a more comprehensive review of the sources of label corruptions, see Section B of [5]. Label corruptions, natural or malicious, can adversely impact the classification performance of classifiers. See, for example, [27, 38, 40] for impacts on different machine learning techniques. It is, therefore, important to explore robust techniques that can mitigate, or even eliminate, the consequences of label corruptions.

1.1 Related Work There exist a large amount of literature on learning of classifiers in the presence of label noises/errors. See, for example, [5] for a detailed survey. Methods to enhance the model robustness against label noises include modifying the network architecture and introducing corrections to the loss function [11, 15, 29]. Larsen et al. [15] proposed a framework for designing robust neural network (NN) classifiers by

Data Loading...

A Non-intrusive Correction Algorithm for Classification Problems with Corrupted Data

Recommend Documents

A Constructive Algorithm for Real Valued Multi-category Classification Problems

Hybrid Efficient Genetic Algorithm for Big Data Feature Selection Problems

AUG-BERT: An Efficient Data Augmentation Algorithm for Text Classification

A Classification Algorithm for Real Collar Images

Supervised Learning for Classification Problems

Data Augmentation with Transformers for Text Classification

Nonintrusive Stochastic Finite Elements for Crashworthiness with VPS/Pamcrash

Fast gradient descent algorithm for image classification with neural networks

Improved water cycle algorithm with probabilistic neural network to solve classification problems

Effectiveness of Backpropagation Algorithm in Healthcare Data Classification

Unified Performance Measure for Binary Classification Problems

Mixed Integer Classification Problems