Ensemble with estimation: seeking for optimization in class noisy data

PDF / 3,811,022 Bytes
18 Pages / 595.276 x 790.866 pts Page_size
58 Downloads / 233 Views

ORIGINAL ARTICLE

Ensemble with estimation: seeking for optimization in class noisy data Ruifeng Xu1 · Zhiyuan Wen1 · Lin Gui2 · Qin Lu4 · Binyang Li3 · Xizhao Wang5 Received: 26 November 2018 / Accepted: 22 May 2019 © The Author(s) 2019

Abstract Class noise, as know as the mislabeled data in training set, can lead to poor accuracy in classification no matter what machine learning methods are used. A reasonable estimation of class noise has a significant impact on the performance of learning methods. However, the error in existing estimation is inevitable theoretically and infer the performance of optimal classifier trained on noisy data. Instead of seeking a single optimal classifier on noisy data, in this work, we use a set of weak classifiers, which are caused by negative impacts of noisy data, to learn an ensemble strong classifier which is based on the training error and estimation of class noise. By this strategy, the proposed ensemble with estimation method overcomes the gap between the estimation and true distribution of class noise. Our proposed method does not require any a priori knowledge about class noises. We prove that the optimal ensemble classifier on the noisy distribution can approximate the optimal classifier on the clean distribution when the training set grows. Comparisons with existing algorithms show that our methods outperform state-of-the-art approaches on a large number of benchmark datasets in different domains. Both the theoretical analysis and the experimental result reveal that our method can improve the performance, works well on clean data and is robust on the algorithm parameter. Keywords Class Noise · Ensemble Learning · Machine Learning

1 Introduction Typical machine learning method uses a classifier learned from a labeled dataset (i.e., the training data) to predict the class labels of new samples (i.e., the testing data). In most of classification applications, labels of the training data are assumed correct. However, real-world datasets often contain noise which may occur either in the features of the data, defined as the attribute noise, or in the labels of the data, defined as the class noise. Many studies have focused on handling attribute noise since it is quite common in machine learning and data

* Lin Gui [email protected] 1

Harbin Institute of Technology Shenzhen, Shenzhen, China

2

Department of Computer Science, University of Warwick, Coventry, UK

3

University of International Relations, Beijing, China

4

Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Hong Kong

5

Big Data Institute, ShenZhen University, Shenzhen, China

mining tasks. However, researchers, such as [1, 2], have indicated that class noise can be potentially more detrimental than attribute noise. The study on class noise problem has an essential impact on classification performance improvement [1]. We must point out that class noise is unavoidable in many real world applications such as disease prediction in medical applications [3], food labeling for th

Data Loading...

Ensemble with estimation: seeking for optimization in class noisy data

Recommend Documents

Learning with Noisy Class Labels for Instance Segmentation

Dependency Parsing with Noisy Multi-annotation Data

Iris Segmentation Using Feature Channel Optimization for Noisy Environments

Unsupervised Feature Analysis with Class Margin Optimization

Efficient Multi-class Fetal Brain Segmentation in High Resolution MRI Reconstructions with Noisy Labels

Ensemble Learning for Heterogeneous Missing Data Imputation

MeLiF+: Optimization of Filter Ensemble Algorithm with Parallel Computing

Hyperparameters tuning of ensemble model for software effort estimation

Optimization Software Class Libraries

Ensemble Distilling Pretrained Language Models for Machine Translation Quality Estimation

Developing Ensemble Methods for Detecting Anomalies in Water Level Data

Estimation of linear functionals from indirect noisy data without knowledge of the noise level