An Incremental Algorithm for Repairing Training Sets with Missing Values
Real-life datasets that occur in domains such as industrial process control, medical diagnosis, marketing, risk management, often contain missing values. This poses a challenge for many classification and regression algorithms which require complete train
- PDF / 190,003 Bytes
- 12 Pages / 439.37 x 666.142 pts Page_size
- 65 Downloads / 207 Views
Abstract. Real-life datasets that occur in domains such as industrial process control, medical diagnosis, marketing, risk management, often contain missing values. This poses a challenge for many classification and regression algorithms which require complete training sets. In this paper we present a new approach for “repairing” such incomplete datasets by constructing a sequence of regression models that iteratively replace all missing values. Additionally, our approach uses the target attribute to estimate the values of missing data. The accuracy of our method, Incremental Attribute Regression Imputation, IARI, is compared with the accuracy of several popular and state of the art imputation methods, by applying them to five publicly available benchmark datasets. The results demonstrate the superiority of our approach.
Keywords: Missing data Random forest
1
·
Imputation
·
Regression
·
Classification
·
Introduction
In industrial processes and many other real-world applications, data points are collected to gain insight into the process and to make important decisions. Understanding and making predictions for these processes are vital for their optimization. Missing values in the collected data cause additional problems in building predictive models and applying them to fresh data. Unfortunately, missing values are very common and occur in many processes, for example, sensors that collect data from a production line may fail; a physician that examines a patient might skip some tests; questionnaires used in market surveys often contain unanswered questions, etc. This problem leads to the following questions: 1. How to build high quality models for classification and regression, when some values in the training set are missing? 2. How to apply trained models to records with missing values? In this paper we address only the first question, leaving the answers to the second one for further research. c Springer International Publishing Switzerland 2016 J.P. Carvalho et al. (Eds.): IPMU 2016, Part II, CCIS 611, pp. 175–186, 2016. DOI: 10.1007/978-3-319-40581-0 15
176
B. van Stein and W. Kowalczyk
There are several methods developed for tackling this problem, see e.g., [4,5,11,15,16]. The most common method, imputation, reconstructs the missing values with help of various estimates such as means, medians, or simple regression models which predict the missing values. In this paper we present a more sophisticated approach, Incremental Attribute Regression Imputation, IARI, which prioritizes all attributes with missing values and then iteratively “repairs” each of them, one by one, using values of all attributes that have no missing values or are already repaired, as predictors. Additionally, the target variable is also used as a predictor in the repair process. Repairing an attribute is achieved by constructing a regression model and applying it for estimation of missing values. We use here the Random Forest algorithm, [3,6], due to its accuracy, robustness, and versatility: it can be used to model both numeri
Data Loading...