Double random forest

  • PDF / 1,251,942 Bytes
  • 18 Pages / 439.37 x 666.142 pts Page_size
  • 76 Downloads / 300 Views

DOWNLOAD

REPORT


Double random forest Sunwoo Han1 · Hyunjoong Kim2   · Yung‑Seop Lee3 Received: 7 September 2019 / Revised: 4 May 2020 / Accepted: 4 June 2020 © The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2020

Abstract Random forest (RF) is one of the most popular parallel ensemble methods, using decision trees as classifiers. One of the hyper-parameters to choose from for RF fitting is the nodesize, which determines the individual tree size. In this paper, we begin with the observation that for many data sets (34 out of 58), the best RF prediction accuracy is achieved when the trees are grown fully by minimizing the nodesize parameter. This observation leads to the idea that prediction accuracy could be further improved if we find a way to generate even bigger trees than the ones with a minimum nodesize. In other words, the largest tree created with the minimum nodesize parameter may not be sufficiently large for the best performance of RF. To produce bigger trees than those by RF, we propose a new classification ensemble method called double random forest (DRF). The new method uses bootstrap on each node during the tree creation process, instead of just bootstrapping once on the root node as in RF. This method, in turn, provides an ensemble of more diverse trees, allowing for more accurate predictions. Finally, for data where RF does not produce trees of sufficient size, we have successfully demonstrated that DRF provides more accurate predictions than RF. Keywords  Classification · Ensemble · Random forest · Bootstrap · Decision tree

Editor: Byron Wallace. * Hyunjoong Kim [email protected] Sunwoo Han [email protected] Yung‑Seop Lee [email protected] 1

Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98006, USA

2

Department of Applied Statistics, Yonsei University, Seoul 03722, South Korea

3

Department of Statistics, Dongguk University, Seoul 04620, South Korea



13

Vol.:(0123456789)



Machine Learning

1 Introduction The ensemble method combines multiple different models to achieve better prediction accuracy for classification and regression (Dietterich 2000). Ensemble methods generally perform better than single fitted models  (Hansen and Salamon 1990). Because of the good performance, ensemble methods are widely used in machine learning and statistics community (Breiman 1996; Freund and Schapire 1996; Bauer and Kohavi 1999; Amaratunga et al. 2008; Wolf et al. 2010). Classification ensemble methods consist of several classifiers, typically decision trees. The well-known methodologies of classification ensemble are Boosting  (Freund and Schapire 1996; Schapire 1990; Freund and Schapire 1997), Bagging  (Breiman 1996), Random Forest  (Breiman 2001), Gradient Boosting  (Mason et  al. 1999; Hastie et  al. 2009), and XGBoost (Chen and Guestrin 2016). Random forest (RF) is a widely used method in various fields because it has many advantages over other classification ensemble methods. RF is fast in both training and predi