New Options for Hoeffding Trees
Hoeffding trees are state-of-the-art for processing high-speed data streams. Their ingenuity stems from updating sufficient statistics, only addressing growth when decisions can be made that are guaranteed to be almost identical to those that would be mad
- PDF / 490,238 Bytes
- 10 Pages / 430 x 660 pts Page_size
- 45 Downloads / 230 Views
stract. Hoeffding trees are state-of-the-art for processing high-speed data streams. Their ingenuity stems from updating sufficient statistics, only addressing growth when decisions can be made that are guaranteed to be almost identical to those that would be made by conventional batch learning methods. Despite this guarantee, decisions are still subject to limited lookahead and stability issues. In this paper we explore Hoeffding Option Trees, a regular Hoeffding tree containing additional option nodes that allow several tests to be applied, leading to multiple Hoeffding trees as separate paths. We show how to control tree growth in order to generate a mixture of paths, and empirically determine a reasonable number of paths. We then empirically evaluate a spectrum of Hoeffding tree variations: single trees, option trees and bagged trees. Finally, we investigate pruning. We show that on some datasets a pruned option tree can be smaller and more accurate than a single tree.
1
Introduction
When training a model on a data stream it is important to be conservative with memory and to make a single scan of the data as quickly as possible. Hoeffding trees [6] achieve this by accumulating sufficient statistics from examples in a node to the point where they can be used to make a sensible split decision. A decision, in fact, that is guaranteed to be almost as reliable as the one that would be made by an algorithm that could see all of the data. The sufficient statistics turn out to be beneficial for both tree growth and prediction as they can be used to build Naive Bayes models at the leaves of the tree that are more accurate than majority class estimates [7]. There remain situations where Naive Bayes leaves are not always superior to majority class, and [8] shows that adaptively deciding when their use is beneficial per leaf of the tree can further enhance classification performance. Given this sequence of improvements to the basic algorithm, is it possible to improve predictive performance further? Contemporary machine learning would suggest that the next logical step is to employ ensembles of Hoeffding trees using a framework such as bagging or boosting. In part, ensemble methods seek to overcome problems inherent in all greedy tree learners. Tree learners have limited lookahead ability (the best scoring single M.A. Orgun and J. Thornton (Eds.): AI 2007, LNAI 4830, pp. 90–99, 2007. c Springer-Verlag Berlin Heidelberg 2007
New Options for Hoeffding Trees
91
test based on the children that test would generate) and they are not stable. Stability [2] relates to the selection of a particular attribute split when the scores of other attribute splits are close. The examples processed just prior to the split decision could have an undue influence over this decision and it may not be the best decision in the longer term. Breiman [4] demonstrates just how unstable this influence can be, especially when marginal decisions are taken near the root of the tree. Even though decisions are made with more data in a Hoeffding tree, they are still prone
Data Loading...