Agreeing to disagree: active learning with noisy labels without crowdsourcing

  • PDF / 4,759,421 Bytes
  • 13 Pages / 595.276 x 790.866 pts Page_size
  • 22 Downloads / 194 Views

DOWNLOAD

REPORT


ORIGINAL ARTICLE

Agreeing to disagree: active learning with noisy labels without crowdsourcing Mohamed‑Rafik Bouguelia1 · Slawomir Nowaczyk1 · K. C. Santosh2   · Antanas Verikas1 

Received: 1 September 2016 / Accepted: 18 January 2017 © Springer-Verlag Berlin Heidelberg 2017

Abstract  We propose a new active learning method for classification, which handles label noise without relying on multiple oracles (i.e., crowdsourcing). We propose a strategy that selects (for labeling) instances with a high influence on the learned model. An instance x is said to have a high influence on the model h, if training h on x (with label y = h(x)) would result in a model that greatly disagrees with h on labeling other instances. Then, we propose another strategy that selects (for labeling) instances that are highly influenced by changes in the learned model. An instance x is said to be highly influenced, if training h with a set of instances would result in a committee of models that agree on a common label for x but disagree with h(x). We compare the two strategies and we show, on different publicly available datasets, that selecting instances according to the first strategy while eliminating noisy labels according to the second strategy, greatly improves the accuracy compared to several benchmarking methods, even when a significant amount of instances are mislabeled.

* K. C. Santosh [email protected] Mohamed‑Rafik Bouguelia [email protected] Slawomir Nowaczyk [email protected] Antanas Verikas [email protected] 1

Center for Applied Intelligent Systems Research, Halmstad University, 30118 Halmstad, Sweden

2

Department of Computer Science, The University of South Dakota, 414 E Clark St, Vermillion, SD 57069, USA



Keywords  Active learning · Classification · Label noise · Mislabeling

1 Introduction In order to learn a classification model, supervised learning algorithms need a training dataset where each instance is manually labeled. With a large amount of unlabeled instances, one needs to manually label as much instances as possible. Such instances are randomly selected by a human labeler or oracle (i.e., passive learning). With this setting, the learning methods need huge labeled data to produce an optimized classifier. Note that labeling is costly and time consuming. Semi-supervised learning methods like [21] learn using both labeled and unlabeled data, and can therefore be used to reduce the labeling cost to some extent. Nonetheless, instead of randomly selecting the instances to be labeled, active learning methods allow to further reduce the labeling cost by allowing interaction between the learning algorithm and the oracle. Unlike a passive learning, active learning lets the learner choose which instances are more appropriate for labeling, according to an informativeness measure. The main problem that active learning addresses is about defining informativeness in a way that reduces the number of instances to be labeled along with the improvement of the classifier’s performance. This is an important problem beca