Versatile Decision Trees for Learning Over Multiple Contexts
Discriminative models for classification assume that training and deployment data are drawn from the same distribution. The performance of these models can vary significantly when they are learned and deployed in different contexts with different data dis
- PDF / 586,506 Bytes
- 16 Pages / 439.37 x 666.142 pts Page_size
- 55 Downloads / 170 Views
Intelligent System Laboratory, Computer Science, University of Bristol, Bristol, UK {ra12404,meelis.kull,peter.flach}@bristol.ac.uk 2 King Abdulaziz University, Jeddah, Saudi Arabia 3 Informatics Center, Universidade Federal de Pernambuco, Recife, Brazil [email protected]
Abstract. Discriminative models for classification assume that training and deployment data are drawn from the same distribution. The performance of these models can vary significantly when they are learned and deployed in different contexts with different data distributions. In the literature, this phenomenon is called dataset shift. In this paper, we address several important issues in the dataset shift problem. First, how can we automatically detect that there is a significant difference between training and deployment data to take action or adjust the model appropriately? Secondly, different shifts can occur in real applications (e.g., linear and non-linear), which require the use of diverse solutions. Thirdly, how should we combine the original model of the training data with other models to achieve better performance? This work offers two main contributions towards these issues. We propose a Versatile Model that is rich enough to handle different kinds of shift without making strong assumptions such as linearity, and furthermore does not require labelled data to identify the data shift at deployment. Empirical results on both synthetic shift and real datasets shift show strong performance gains by achieved the proposed model. Keywords: Versatile model · Decision Trees Percentile · Kolmogorov-Smirnov test
1
·
Dataset shift
·
Introduction
Supervised machine learning is typically concerned with learning a model using training data and applying this model to new test data. An implicit assumption made for successfully deploying a model is that both training and test data follow the same distribution. However, the distribution of the attributes can change, especially when the training data is gathered in one context, but the model is deployed in a different context (e.g., the training data is collected in one country but the predictions are required for another country). The presence of such dataset shifts can harm the performance of a learned model. Different c Springer International Publishing Switzerland 2015 A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 184–199, 2015. DOI: 10.1007/978-3-319-23528-8 12
Versatile Decision Trees for Learning Over Multiple Contexts
185
kinds of dataset shift have been investigated in the literature [10]. In this work we focus on shifts in continuous attributes caused by hidden transformations from context to another. For instance, a diagnostic test may have different resolutions when produced by different laboratories, or the average temperature may change from city to city. In such cases, the distribution of one or more of the covariates in X changes. This problem is referred as a covariate observation shift [7]. We address this problem in two steps. In the first step, we build Decision Trees (DTs)
Data Loading...