A boosting Self-Training Framework based on Instance Generation with Natural Neighbors for K Nearest Neighbor

  • PDF / 3,312,113 Bytes
  • 19 Pages / 595.276 x 790.866 pts Page_size
  • 94 Downloads / 190 Views

DOWNLOAD

REPORT


A boosting Self-Training Framework based on Instance Generation with Natural Neighbors for K Nearest Neighbor Junnan Li 1 & Qingsheng Zhu 1

# Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract The semi-supervised self-training method is one of the successful methodologies of semi-supervised classification. The mislabeling is the most challenging issue in self-training methods and the ensemble learning is one of the common techniques for dealing with the mislabeling. Specifically, the ensemble learning can solve or alleviate the mislabeling by constructing an ensemble classifier to improve prediction accuracy in the self-training process. However, most ensemble learning methods may not perform well in self-training methods because it is difficult for ensemble learning methods to train an effective ensemble classifier with a small number of labeled data. Inspired by the successful boosting methods, we introduce a new boosting selftraining framework based on instance generation with natural neighbors (BoostSTIG) in this paper. BoostSTIG is compatible with most boosting methods and self-training methods. It can use most boosting methods to solve or alleviate the mislabeling of existing self-training methods by improving the prediction accuracy in the self-training process. Besides, an instance generation with natural neighbors is proposed to enlarge initial labeled data in BoostSTIG, which makes boosting methods more suitable for self-training methods. In experiments, we apply the BoostSTIG framework to 2 self-training methods and 4 boosting methods, and then validate BoostSTIG by comparing some state-of-the-art technologies on real data sets. Intensive experiments show that BoostSTIG can improve the performance of tested self-training methods and train an effective k nearest neighbor. Keywords Semi-supervised learning (SSL) . Semi-supervised classification (SSC) . Self-training . Boosting . Instance generation . Natural neighbors

1 Introduction Classification [1] has attracted great attention from scholars in machine learning and pattern recognition. Because of its importance and great values, it has been applied in text classification, biological medical treatment, spam classification, risk management, digital image, etc., [2–6]. In traditional classification tasks, an effective prediction model is trained on sufficient labeled data. Unfortunately, it is not easy to obtain a large number of labeled samples due to high labor costs and huge time consumption. This was the main motivation that led to the inception of the semi-supervised classification (SSC) [7, 8]. SSC can use both labeled and unlabeled data to train a prediction model and complete classification tasks. Two main

* Qingsheng Zhu [email protected] 1

Chongqing Key Laboratory of Software Theory & Technology, College of Computer Science, Chongqing, China

objectives of SSC are transductive and inductive classification [9, 10]. In transductive classification, a trained prediction model is used to predict the label of a subset of un