Combination of Active and Random Labeling Strategy in the Non-stationary Data Stream Classification

A significant problem when building classifiers based on data stream is information about the correct label. Most algorithms assume access to this information without any restrictions. Unfortunately, this is not possible in practice because the objects ca

  • PDF / 778,870 Bytes
  • 10 Pages / 439.37 x 666.142 pts Page_size
  • 27 Downloads / 194 Views

DOWNLOAD

REPORT


bstract. A significant problem when building classifiers based on data stream is information about the correct label. Most algorithms assume access to this information without any restrictions. Unfortunately, this is not possible in practice because the objects can come very quickly and labeling all of them is impossible, or we have to pay for providing the correct label (e.g., to human expert). Hence, methods based on partially labeled data, including methods based on an active learning approach, are becoming increasingly popular, i.e., when the learning algorithm itself decides which of the objects are interesting to improve the quality of the predictive model effectively. In this paper, we propose a new method of active learning of data stream classifier. Its quality has been compared with benchmark solutions based on a large number of test streams, and the results obtained prove the usefulness of the proposed method, especially in the case of a low budget dedicated to the labeling of incoming objects. Keywords: Data stream classification · Active learning · Concept drift

1

Introduction

The design of classifiers for streaming data is the subject of intensive research because, currently, for most decision tasks, data is arriving continuously [4]. During the construction of such a type of system, we must take into account several vital issues, such as limited both memory and computing resources, which means that not all incoming data can be memorized and that each object can be analyzed at most once [3]. Another difficulty encountered in the construction of stream data classifiers is the phenomenon called concept drift, which means that when we use and train the classification model, the probability characteristics of the classification model may change at the same time [5]. Therefore, the classifier dedicated to this type of task, in addition to taking into account the limitations of available computing and memory resources, must ensure a correct response to concept drift. c Springer Nature Switzerland AG 2020  L. Rutkowski et al. (Eds.): ICAISC 2020, LNAI 12415, pp. 576–585, 2020. https://doi.org/10.1007/978-3-030-61401-0_54

Combination of Active and Random Labeling

577

In this work, we will also deal with another critical problem encountered during streaming data analysis, namely access to the correct label for incoming objects. Many of the methods described in the literature ignore this topic, assuming that labels are always available. They ignore the fact that, on the one hand, even if we could label the incoming objects, they can come quickly enough that labeling all of them will be impossible, or they may come around the clock, which strongly hinders such labeling for logistical reasons. On the other hand, the cost of labeling should be also taken into consideration. Sometimes their cost is negligible, e.g., in the case of weather forecasting (we can get a label with a delay, but the cost is only related to the observation and imputing it into the system). However, for most cases, such as medical diagnostics, lab