Confidence in predictions from random tree ensembles

  • PDF / 747,528 Bytes
  • 20 Pages / 439.37 x 666.142 pts Page_size
  • 113 Downloads / 197 Views

DOWNLOAD

REPORT


Confidence in predictions from random tree ensembles Siddhartha Bhattacharyya

Received: 13 March 2012 / Revised: 27 August 2012 / Accepted: 4 December 2012 / Published online: 9 January 2013 © Springer-Verlag London 2013

Abstract Obtaining an indication of confidence of predictions is desirable for many data mining applications. Predictions complemented with confidence levels can inform on the certainty or extent of reliability that may be associated with the prediction. This can be useful in varied application contexts where model outputs form the basis for potentially costly decisions, and in general across risk sensitive applications. The conformal prediction framework presents a novel approach for obtaining valid confidence measures associated with predictions from machine learning algorithms. Confidence levels are obtained from the underlying algorithm, using a non-conformity measure which indicates how ‘atypical’ a given example set is. The non-conformity measure is a key to determining the usefulness and efficiency of the approach. This paper considers inductive conformal prediction in the context of random tree ensembles like random forests, which have been noted to perform favorably across problems. Focusing on classification tasks, and considering realistic data contexts including class imbalance, we develop non-conformity measures for assessing the confidence of predicted class labels from random forests. We examine the performance of these measures on multiple data sets. Results demonstrate the usefulness and validity of the measures, their relative differences, and highlight the effectiveness of conformal prediction random forests for obtaining predictions with associated confidence. Keywords Prediction confidence · Random forests · Conformal prediction · Classification · Data mining 1 Introduction Obtaining an indication of confidence of predictions is desirable for many data mining applications. Predictions complemented with confidence levels can inform on the certainty

S. Bhattacharyya (B) Information and Decision Sciences, College of Business Administration, University of Illinois, Chicago, IL, USA e-mail: [email protected]

123

392

S. Bhattacharyya

or extent of reliability that may be associated with the prediction. This can be useful, for example, where model outputs form the basis for potentially costly decisions, where decision makers use predictions in determining actions that involve allocation of limited resources, and in general for risk sensitive applications. Here, one may focus on high-confidence predictions or seek alternate strategies to deal with lower confidence cases. This can be beneficial in various applications, ranging from direct marketing, customer attrition and retention efforts, fraud prediction in credit card, insurance, healthcare, etc., network intrusion, to churn or bankruptcy predictions and medical diagnosis. The focus in this paper is on confidence values for classification problems such as these, where the dependent variable specifies a class label. In many of the af