An Incremental Algorithm for Repairing Training Sets with Missing Values

Real-life datasets that occur in domains such as industrial process control, medical diagnosis, marketing, risk management, often contain missing values. This poses a challenge for many classification and regression algorithms which require complete train

PDF / 190,003 Bytes
12 Pages / 439.37 x 666.142 pts Page_size
65 Downloads / 231 Views

DOWNLOAD

REPORT

Abstract. Real-life datasets that occur in domains such as industrial process control, medical diagnosis, marketing, risk management, often contain missing values. This poses a challenge for many classification and regression algorithms which require complete training sets. In this paper we present a new approach for “repairing” such incomplete datasets by constructing a sequence of regression models that iteratively replace all missing values. Additionally, our approach uses the target attribute to estimate the values of missing data. The accuracy of our method, Incremental Attribute Regression Imputation, IARI, is compared with the accuracy of several popular and state of the art imputation methods, by applying them to five publicly available benchmark datasets. The results demonstrate the superiority of our approach.

Keywords: Missing data Random forest

1

·

Imputation

·

Regression

·

Classification

·

Introduction

In industrial processes and many other real-world applications, data points are collected to gain insight into the process and to make important decisions. Understanding and making predictions for these processes are vital for their optimization. Missing values in the collected data cause additional problems in building predictive models and applying them to fresh data. Unfortunately, missing values are very common and occur in many processes, for example, sensors that collect data from a production line may fail; a physician that examines a patient might skip some tests; questionnaires used in market surveys often contain unanswered questions, etc. This problem leads to the following questions: 1. How to build high quality models for classiﬁcation and regression, when some values in the training set are missing? 2. How to apply trained models to records with missing values? In this paper we address only the ﬁrst question, leaving the answers to the second one for further research. c Springer International Publishing Switzerland 2016 J.P. Carvalho et al. (Eds.): IPMU 2016, Part II, CCIS 611, pp. 175–186, 2016. DOI: 10.1007/978-3-319-40581-0 15

176

B. van Stein and W. Kowalczyk

There are several methods developed for tackling this problem, see e.g., [4,5,11,15,16]. The most common method, imputation, reconstructs the missing values with help of various estimates such as means, medians, or simple regression models which predict the missing values. In this paper we present a more sophisticated approach, Incremental Attribute Regression Imputation, IARI, which prioritizes all attributes with missing values and then iteratively “repairs” each of them, one by one, using values of all attributes that have no missing values or are already repaired, as predictors. Additionally, the target variable is also used as a predictor in the repair process. Repairing an attribute is achieved by constructing a regression model and applying it for estimation of missing values. We use here the Random Forest algorithm, [3,6], due to its accuracy, robustness, and versatility: it can be used to model both numeri

Data Loading...

An Incremental Algorithm for Repairing Training Sets with Missing Values

Recommend Documents

Missing Values

Missing Values

Fehlende Datenwerte/Missing Values

A repairing missing activities approach with succession relation for event logs

Incremental hashing with sample selection using dominant sets

An improved MPPT control strategy based on incremental conductance algorithm

Research on an Improved SVM Training Algorithm

ABLA: An Algorithm for Repairing Structure-Based Locators Through Attribute Annotations

An Algorithm for Constructing Strongly Connected Dominating and Absorbing Sets in Wireless Networks with Unidirectional

An integrated classification model for incremental learning

An Incremental Verification Paradigm for Embedded Systems

Randomized Incremental Construction of Delaunay Triangulations of Nice Point Sets