Regression Tree Models
Decision trees are models that process data to split it in strategic places to divide the data into groups with high probabilities of one outcome or another. It is especially effective at data with categorical outcomes, but can also be applied to continuo
- PDF / 534,974 Bytes
- 10 Pages / 439.37 x 666.142 pts Page_size
- 42 Downloads / 199 Views
Regression Tree Models
Decision trees are models that process data to split it in strategic places to divide the data into groups with high probabilities of one outcome or another. It is especially effective at data with categorical outcomes, but can also be applied to continuous data, such as the time series we have been considering. Decision trees consist of nodes, or splits in the data defined as particular cutoffs for a particular independent variable, and leaves, which are the outcome. For categorical data, the outcome is a class. For continuous data, the outcome is a continuous number, usually some average measure of the dependent variable. Witten and Frank [1] describe the use of decision trees for numeric prediction as regression trees, based upon statistical use of the term regression for the operation of computing numeric quantities with averaged numeric values. Regression equations can be combined with regression trees, as in the M5P model we will present below. But a more basic model is demonstrated with R, here used to predict MSCIchina considering all of the candidate independent variables in our dataset.
5.1
R Regression Trees
The variables in the full data set are: {Time, S&P500, NYSE, Eurostoxx, Brent, and Gold}. Opening R and loading the dataset MultiregCSV.csv as we have in Chap. 4, executing, and selecting the “Model” tab, we obtain the screen in Fig. 5.1. This screen informs us that R is applying a model similar to CART or ID3/C4, widely known decision tree algorithms. Executing this in R yields the output in Fig. 5.2. The decision tree provided by Fig. 5.2 output is displayed in Table 5.1. The decision tree model in R selected 70 % of the input data to build its model, based on the conventional practice to withhold a portion of the input data to test (or refine) the model. Thus it used 128 of the 184 available observations. The input data values for MSCIchina ranged from 13.72 to 102.98 (in late 2007), with 2016 © Springer Science+Business Media Singapore 2017 D.L. Olson and D. Wu, Predictive Data Mining Models, Computational Risk Management, DOI 10.1007/978-981-10-2543-3_5
45
46
5 Regression Tree Models
Fig. 5.1 R regression tree screen
monthly values in the 51–57 range. For forecasting, the first three rules don’t apply (they are for the range January 2001 through October 2006). The last four rules thus rely on estimates of Eurostoxx (in the 3000s for the past year or so) and NYSE (which recently has been above and below the cutoff of 10,781). Thus this model realistically forecasts MSCIchina to be in the 60 s, reasonable enough given its values in the past few years. We can also run this model with the trimmed data (not including Eurostoxx). Table 5.2 shows the R output, again based on 128 observations. The only rules here that apply to forecasting are the fourth row where NYSE < 8047.2 (hasn’t happened since August 2012) and the last row, which forecasts an MSCIchina value of 64.04. This is obviously far more precise than is realistic, but falls within the range of the model using a
Data Loading...