Data Split Strategiesfor Evolving Predictive Models

A conventional textbook prescription for building good predictive models is to split the data into three parts: training set (for model fitting), validation set (for model selection), and test set (for final model assessment). Predictive models can potent

  • PDF / 827,195 Bytes
  • 17 Pages / 439.37 x 666.142 pts Page_size
  • 1 Downloads / 183 Views

DOWNLOAD

REPORT


Abstract. A conventional textbook prescription for building good predictive models is to split the data into three parts: training set (for model fitting), validation set (for model selection), and test set (for final model assessment). Predictive models can potentially evolve over time as developers improve their performance either by acquiring new data or improving the existing model. The main contribution of this paper is to discuss problems encountered and propose workflows to manage the allocation of newly acquired data into different sets in such dynamic model building and updating scenarios. Specifically we propose three different workflows (parallel dump, serial waterfall, and hybrid) for allocating new data into the existing training, validation, and test splits. Particular emphasis is laid on avoiding the bias due to the repeated use of the existing validation or the test set.

Keywords: Data splits

1

· Model assessment · Predictive models

Introduction

A common data mining task is to build a good predictive model which generalizes well on future unseen data. Based on the annotated data collected so far the goal for a machine learning practitioner is to search for the best predictive model (known as supervised learning) and at the same time have a reasonably good estimate of the performance (or risk) of the model on future unseen data. It is well known that the performance of the model on the data used to learn the model (training set) is an overly optimistic estimate of the performance on unseen data. For this reason it is a common practice to sequester a portion of the data to assess the model performance and never use it during the actual model building process. When we are in a data rich situation a conventional textbook prescription (for example refer to Chapter 7 in [6]) is to split the data into three parts: training set, validation set, and test set (See Figure 1). The training set is used for model fitting, that is, estimate the parameters of the model. The validation set is used for model selection, that is, we use the performance of the model on the validation set to select among various competing models (e.g. should we use a linear classifier like logistic regression or a non-linear neural network) or to choose the hyperparameters of the model (e.g. choosing the regularization c Springer International Publishing Switzerland 2015  A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 3–19, 2015. DOI: 10.1007/978-3-319-23528-8 1

4

V.C. Raykar and A. Saha Training 50% model fitting

Validation 25% model selection

Test 25% model assessment

Fig. 1. Data splits for model fitting, selection, and assessment. The training split is used to estimate the model parameters. The validation split is used to estimate prediction error for model selection. The test split is used to estimate the performance of the final chosen model.

parameter for logistic regression or the number of nodes in the hidden layer for a neural network). The test set is then used for final model assessment, that is, to estima