Imputation and low-rank estimation with Missing Not At Random data

PDF / 1,092,944 Bytes
15 Pages / 595.276 x 790.866 pts Page_size
15 Downloads / 207 Views

Imputation and low-rank estimation with Missing Not At Random data Aude Sportisse1,2 · Claire Boyer1,3 · Julie Josse2,4 Received: 27 December 2018 / Accepted: 7 July 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Missing values challenge data analysis because many supervised and unsupervised learning methods cannot be applied directly to incomplete data. Matrix completion based on low-rank assumptions are very powerful solution for dealing with missing values. However, existing methods do not consider the case of informative missing values which are widely encountered in practice. This paper proposes matrix completion methods to recover Missing Not At Random (MNAR) data. Our first contribution is to suggest a model-based estimation strategy by modelling the missing mechanism distribution. An EM algorithm is then implemented, involving a Fast Iterative Soft-Thresholding Algorithm (FISTA). Our second contribution is to suggest a computationally efficient surrogate estimation by implicitly taking into account the joint distribution of the data and the missing mechanism: the data matrix is concatenated with the mask coding for the missing values; a low-rank structure for exponential family is assumed on this new matrix, in order to encode links between variables and missing mechanisms. The methodology that has the great advantage of handling different missing value mechanisms is robust to model specification errors. The performances of our methods are assessed on the real data collected from a trauma registry (TraumaBase® ) containing clinical information about over twenty thousand severely traumatized patients in France. The aim is then to predict if the doctors should administrate tranexomic acid to patients with traumatic brain injury, that would limit excessive bleeding. Keywords Informative missing values · Denoising · Matrix completion · Accelerated proximal gradient method · EM algorithm · Nuclear norm penalty

1 Introduction The problem of missing data is ubiquitous in the practice of data analysis. Main approaches for handling missing data include imputation methods and the use of ExpectationMaximization (EM) algorithm (Dempster et al. 1977) which Julie Josse was supported by the Data Analytics and Models for Insurance chair and Aude Sportisse was funded by a PEPS project (Projet Exploratoire Premier Soutien) of AMIES (the mathematical agency in interaction with companies and society).

B

Aude Sportisse [email protected]

1

Laboratoire de Probabilités Statistique et Modélisation, Sorbonne Université, Paris, France

2

Centre de Mathématiques Appliquées, Ecole Polytechnique, Palaiseau, France

3

Département de Mathématiques et applications, Ecole Normale Supérieure, Paris, France

4

XPOP, INRIA Saclay, Palaiseau, France

allows to get the maximum likelihood estimators in various incomplete-data problems (Little and Rubin 2014). The theoretical guarantees of these methods ensuring the correct prediction of missing values or the correct estimation of some parameters o

Data Loading...

Imputation and low-rank estimation with Missing Not At Random data

Recommend Documents

An Efficient Multiple Imputation Approach for Estimating Equations with Response Missing at Random and High-Dimensional

Nonparametric quantile regression estimation for functional data with responses missing at random

SICE: an improved missing data imputation technique

Ensemble Learning for Heterogeneous Missing Data Imputation

Iterative Imputation of Missing Data Using Auto-Encoder Dynamics

MIDIA: exploring denoising autoencoders for missing data imputation

Data Imputation

Bayesian Estimation of the Precision Matrix with Monotone Missing Data

Missing Value Imputation with MERCS: A Faster Alternative to MissForest

Missing, Presumed Not Dead

Multiple imputation and direct estimation for qPCR data with non-detects

Imputation of Incomplete Data Based on Attribute Cross Fitting Model and Iterative Missing Value Variables