Regularized and incremental decision trees for data streams

  • PDF / 1,133,328 Bytes
  • 11 Pages / 595.224 x 790.955 pts Page_size
  • 52 Downloads / 200 Views

DOWNLOAD

REPORT


Regularized and incremental decision trees for data streams Jean Paul Barddal1

· Fabr´ıcio Enembreck1

Received: 1 October 2019 / Accepted: 18 June 2020 © Institut Mines-T´el´ecom and Springer Nature Switzerland AG 2020

Abstract Decision trees are a widely used family of methods for learning predictive models from both batch and streaming data. Despite depicting positive results in a multitude of applications, incremental decision trees continuously grow in terms of nodes as new data becomes available, i.e., they eventually split on all features available, and also multiple times using the same feature, thus leading to unnecessary complexity and overfitting. With this behavior, incremental trees lose the ability to generalize well, be human-understandable, and be computationally efficient. To tackle these issues, we proposed in a previous study a regularization scheme for Hoeffding decision trees that (i) uses a penalty factor to control the gain obtained by creating a new split node using a feature that has not been used thus far and (ii) uses information from previous splits in the current branch to determine whether the gain observed indeed justifies a new split. In this paper, we extend this analysis and apply the proposed regularization scheme to other types of incremental decision trees and report the results in both synthetic and real-world scenarios. The main interest is to verify whether and how the proposed regularization scheme affects the different types of incremental trees. Results show that in addition to the original Hoeffding Tree, the Adaptive Random Forest also benefits from regularization, yet, McDiarmid Trees and Extremely Fast Decision Trees observe declines in accuracy. Keywords Data stream mining · Classification · Decision tree · Random forest · Regularization

1 Introduction The growth rates of data acquisition, storage, and processing have gathered the effort from both researchers and practitioners towards the efficient analysis and knowledge extraction from these humongous datasets. Such datasets are every day more pervasive as the Internet of Things (IoT) devices rapidly gather and produce data sequentially over time, in the form of potentially unbounded data streams. In addition to processing data streams, it is relevant to learn from them. Therefore, a variety of algorithms were proposed or converted from batch scenarios to learn and update predictive models as new data arrives. For instance, bayesian [1–4] and decision tree [5, 6] models are popular approaches of data stream classification learners.  Jean Paul Barddal

[email protected] Fabr´ıcio Enembreck [email protected] 1

Graduate Program in Informatics (PPGIa), Pontif´ıcia Universidade Cat´olica do Paran´a, Curitiba, Brazil

For instance, Hoeffding [5] and McDiarmid Trees [6] are often labeled as elegant, efficient, and robust approaches that learn decision trees using constant time per instance [5, 7]. Also, both have theoretical guarantees that the convergence between the decision trees learned in streaming