Double random forest

PDF / 1,251,942 Bytes
18 Pages / 439.37 x 666.142 pts Page_size
76 Downloads / 363 Views

Double random forest Sunwoo Han1 · Hyunjoong Kim2 · Yung‑Seop Lee3 Received: 7 September 2019 / Revised: 4 May 2020 / Accepted: 4 June 2020 © The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2020

Abstract Random forest (RF) is one of the most popular parallel ensemble methods, using decision trees as classifiers. One of the hyper-parameters to choose from for RF fitting is the nodesize, which determines the individual tree size. In this paper, we begin with the observation that for many data sets (34 out of 58), the best RF prediction accuracy is achieved when the trees are grown fully by minimizing the nodesize parameter. This observation leads to the idea that prediction accuracy could be further improved if we find a way to generate even bigger trees than the ones with a minimum nodesize. In other words, the largest tree created with the minimum nodesize parameter may not be sufficiently large for the best performance of RF. To produce bigger trees than those by RF, we propose a new classification ensemble method called double random forest (DRF). The new method uses bootstrap on each node during the tree creation process, instead of just bootstrapping once on the root node as in RF. This method, in turn, provides an ensemble of more diverse trees, allowing for more accurate predictions. Finally, for data where RF does not produce trees of sufficient size, we have successfully demonstrated that DRF provides more accurate predictions than RF. Keywords Classification · Ensemble · Random forest · Bootstrap · Decision tree

Editor: Byron Wallace. * Hyunjoong Kim [email protected] Sunwoo Han [email protected] Yung‑Seop Lee [email protected] 1

Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98006, USA

2

Department of Applied Statistics, Yonsei University, Seoul 03722, South Korea

3

Department of Statistics, Dongguk University, Seoul 04620, South Korea

13

Vol.:(0123456789)

Machine Learning

1 Introduction The ensemble method combines multiple different models to achieve better prediction accuracy for classification and regression (Dietterich 2000). Ensemble methods generally perform better than single fitted models (Hansen and Salamon 1990). Because of the good performance, ensemble methods are widely used in machine learning and statistics community (Breiman 1996; Freund and Schapire 1996; Bauer and Kohavi 1999; Amaratunga et al. 2008; Wolf et al. 2010). Classification ensemble methods consist of several classifiers, typically decision trees. The well-known methodologies of classification ensemble are Boosting (Freund and Schapire 1996; Schapire 1990; Freund and Schapire 1997), Bagging (Breiman 1996), Random Forest (Breiman 2001), Gradient Boosting (Mason et al. 1999; Hastie et al. 2009), and XGBoost (Chen and Guestrin 2016). Random forest (RF) is a widely used method in various fields because it has many advantages over other classification ensemble methods. RF is fast in both training and predi

Data Loading...

Double random forest

Recommend Documents

CHIRPS: Explaining random forest classification

Improving protein fold recognition by random forest

Pruning trees in C-fuzzy random forest

Thyroid Disorder Analysis Using Random Forest Classifier

Tailoring Random Forest for Requirements Classification

On Random-Forest-Based Prediction Intervals

Predicting Learner Answers Correctness Through Eye Movements with Random Forest

Firmware Injection Detection on IoT Devices Using Deep Random Forest

Rule Generation of Cataract Patient Data Using Random Forest Algorithm

Random Forest Variable Selection for Sparse Vector Autoregressive Models

Application of Random Forest Classifier in Loan Default Forecast

Skin lesion classification using decision trees and random forest algorithms