MIDIA: exploring denoising autoencoders for missing data imputation

PDF / 4,055,702 Bytes
39 Pages / 439.37 x 666.142 pts Page_size
9 Downloads / 254 Views

MIDIA: exploring denoising autoencoders for missing data imputation Qian Ma1

· Wang-Chien Lee2 · Tao-Yang Fu2 · Yu Gu3 · Ge Yu3

Received: 16 August 2019 / Accepted: 17 July 2020 © The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2020

Abstract Due to the ubiquitous presence of missing values (MVs) in real-world datasets, the MV imputation problem, aiming to recover MVs, is an important and fundamental data preprocessing step for various data analytics and mining tasks to effectively achieve good performance. To impute MVs, a typical idea is to explore the correlations amongst the attributes of the data. However, those correlations are usually complex and thus difficult to identify. Accordingly, we develop a new deep learning model called MIssing Data Imputation denoising Autoencoder (MIDIA) that effectively imputes the MVs in a given dataset by exploring non-linear correlations between missing values and non-missing values. Additionally, by considering various data missing patterns, we propose two effective MV imputation approaches based on the proposed MIDIA model, namely MIDIA-Sequential and MIDIA-Batch. MIDIA-Sequential imputes the MVs attribute-by-attribute sequentially by training an independent MIDIA model for each incomplete attribute. By contrast, MIDIA-Batch imputes the MVs in one batch by training a uniform MIDIA model. Finally, we evaluate the proposed approaches by experimentation in comparison with existing MV imputation algorithms. The experimental results demonstrate that both MIDIA-Sequential and MIDIA-Batch achieve significantly higher imputation accuracy compared with existing solutions, and the proposed approaches are capable of handling various data missing patterns and data types. Specifically, MIDIA-Sequential performs better than MIDIA-Batch for data with monotone missing pattern, while MIDIA-Batch performs better than MIDIASequential for data with general missing pattern. Keywords Missing data imputation · Denoising autoencoder · MIDIA · Deep learning

Responsible editor: Shuiwang Ji.

B

Qian Ma [email protected]

Extended author information available on the last page of the article

123

Q. Ma et al.

1 Introduction Due to various uncontrollable factors, e.g., hardware failure, unconscious malfunction, participants refusal, etc, missing values (MVs) widely exist in various kinds of real-world datasets, e.g., medical datasets, microarray gene datsets, survey datasets and sensing datasets. To many algorithms employed in data analytics, data mining and machine learning (Gharibshah et al. 2020; Dong et al. 2014), data integrity is a prerequisite due to the incompetence of these algorithms in handling datasets with MVs. Moreover, the existence of MVs resulting in information loss, may cause performance degradation of the employed algorithms (Anagnostopoulos and Triantafillou 2014). Therefore, the critical task of missing value imputation (MV imputation), aiming to replace the MVs with some plausible estimations, attracts much research at

Data Loading...

MIDIA: exploring denoising autoencoders for missing data imputation

Recommend Documents

Ensemble Learning for Heterogeneous Missing Data Imputation

SICE: an improved missing data imputation technique

Iterative Imputation of Missing Data Using Auto-Encoder Dynamics

Data Imputation

NARX Neural Network for Imputation of Missing Data in Air Pollution Datasets

Clustering Imputation for Air Pollution Data

Improved Collaborative Filtering Algorithm Based on Stacked Denoising AutoEncoders

Missing Data

Missing Data

Imputation of Incomplete Data Based on Attribute Cross Fitting Model and Iterative Missing Value Variables

Imputation and low-rank estimation with Missing Not At Random data

Missing Value Imputation Approach Using Cosine Similarity Measure