Ensemble Machine Learning Methods and Applications

It is common wisdom that gathering a variety of views and inputs improves the process of decision making, and, indeed, underpins a democratic society. Dubbed “ensemble learning” by researchers in computational intelligence and machine learning, it is know

  • PDF / 498,346 Bytes
  • 19 Pages / 439.36 x 666.15 pts Page_size
  • 6 Downloads / 296 Views

DOWNLOAD

REPORT


Random Forests Adele Cutler, D. Richard Cutler, and John R. Stevens

5.1 Introduction Random Forests were introduced by Leo Breiman [6] who was inspired by earlier work by Amit and Geman [2]. Although not obvious from the description in [6], Random Forests are an extension of Breiman’s bagging idea [5] and were developed as a competitor to boosting. Random Forests can be used for either a categorical response variable, referred to in [6] as “classification,” or a continuous response, referred to as “regression.” Similarly, the predictor variables can be either categorical or continuous. From a computational standpoint, Random Forests are appealing because they • • • • • •

naturally handle both regression and (multiclass) classification; are relatively fast to train and to predict; depend only on one or two tuning parameters; have a built in estimate of generalization error; can be used directly for high-dimensional problems; can easily be implemented in parallel.

Statistically, Random Forests are appealing because of the additional features they provide, such as • • • • • •

measures of variable importance; differential class weighting; missing value imputation; visualization; outlier detection; unsupervised learning.

A. Cutler () • D.R. Cutler • J.R. Stevens Department of Mathematics and Statistics, Utah State University, Logan, UT 84322-3900, USA e-mail: [email protected]; [email protected]; [email protected] C. Zhang and Y. Ma (eds.), Ensemble Machine Learning: Methods and Applications, DOI 10.1007/978-1-4419-9326-7 5, © Springer Science+Business Media, LLC 2012

157

158

A. Cutler et al.

This chapter gives an introduction to the Random Forest method for classification and regression, including a brief description of the types of classification and regression trees used in the Random Forests algorithm. The chapter describes how out-of-bag data are used not only to give a fast estimate of generalization error but also to estimate variable importance. A discussion of some important practical issues such as tuning the algorithm and weighting classes to deal with unequal sample sizes is also included. Methods for finding Random Forest proximities and using them to give illuminating plots as well as imputing missing values are presented. Finally, references to extensions of the Random Forest method are given.

5.2 The Random Forest Algorithm As the name suggests, a Random Forest is a tree-based ensemble with each tree depending on a collection of random variables. More formally, for a p-dimensional random vector X D .X1 ; : : : ; Xp /T representing the real-valued input or predictor variables and a random variable Y representing the real-valued response, we assume an unknown joint distribution PX Y .X; Y /. The goal is to find a prediction function f .X / for predicting Y . The prediction function is determined by a loss function L.Y; f .X // and defined to minimize the expected value of the loss EX Y .L.Y; f .X ///

(5.1)

where the subscripts denote expectation with respect to the joint distribu