Data Split Strategiesfor Evolving Predictive Models

A conventional textbook prescription for building good predictive models is to split the data into three parts: training set (for model fitting), validation set (for model selection), and test set (for final model assessment). Predictive models can potent

PDF / 827,195 Bytes
17 Pages / 439.37 x 666.142 pts Page_size
1 Downloads / 211 Views

DOWNLOAD

REPORT

Abstract. A conventional textbook prescription for building good predictive models is to split the data into three parts: training set (for model ﬁtting), validation set (for model selection), and test set (for ﬁnal model assessment). Predictive models can potentially evolve over time as developers improve their performance either by acquiring new data or improving the existing model. The main contribution of this paper is to discuss problems encountered and propose workﬂows to manage the allocation of newly acquired data into diﬀerent sets in such dynamic model building and updating scenarios. Speciﬁcally we propose three diﬀerent workﬂows (parallel dump, serial waterfall, and hybrid) for allocating new data into the existing training, validation, and test splits. Particular emphasis is laid on avoiding the bias due to the repeated use of the existing validation or the test set.

Keywords: Data splits

1

· Model assessment · Predictive models

Introduction

A common data mining task is to build a good predictive model which generalizes well on future unseen data. Based on the annotated data collected so far the goal for a machine learning practitioner is to search for the best predictive model (known as supervised learning) and at the same time have a reasonably good estimate of the performance (or risk) of the model on future unseen data. It is well known that the performance of the model on the data used to learn the model (training set) is an overly optimistic estimate of the performance on unseen data. For this reason it is a common practice to sequester a portion of the data to assess the model performance and never use it during the actual model building process. When we are in a data rich situation a conventional textbook prescription (for example refer to Chapter 7 in [6]) is to split the data into three parts: training set, validation set, and test set (See Figure 1). The training set is used for model fitting, that is, estimate the parameters of the model. The validation set is used for model selection, that is, we use the performance of the model on the validation set to select among various competing models (e.g. should we use a linear classiﬁer like logistic regression or a non-linear neural network) or to choose the hyperparameters of the model (e.g. choosing the regularization c Springer International Publishing Switzerland 2015 A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 3–19, 2015. DOI: 10.1007/978-3-319-23528-8 1

4

V.C. Raykar and A. Saha Training 50% model fitting

Validation 25% model selection

Test 25% model assessment

Fig. 1. Data splits for model fitting, selection, and assessment. The training split is used to estimate the model parameters. The validation split is used to estimate prediction error for model selection. The test split is used to estimate the performance of the ﬁnal chosen model.

parameter for logistic regression or the number of nodes in the hidden layer for a neural network). The test set is then used for ﬁnal model assessment, that is, to estima

Data Loading...

Data Split Strategiesfor Evolving Predictive Models

Recommend Documents

Predictive Data Mining Models

Earthquake Data in Engineering Seismology Predictive Models, Data Ma

Building Fair Predictive Models

Feature Drift Detection in Evolving Data Streams

Exploring Interpretable Predictive Models for Business Processes

Learning Predictive Models from Observation and Interaction

Evolving Principles of Big Data Virtualization

Predictive K-means with Local Models

Data Models

Combining structured and unstructured data for predictive models: a deep learning approach

Leveraging TCGA gene expression data to build predictive models for cancer drug response

Building predictive models for direct mail: A framework for choosing training and test data