Imputation and low-rank estimation with Missing Not At Random data
- PDF / 1,092,944 Bytes
- 15 Pages / 595.276 x 790.866 pts Page_size
- 15 Downloads / 207 Views
Imputation and low-rank estimation with Missing Not At Random data Aude Sportisse1,2 · Claire Boyer1,3 · Julie Josse2,4 Received: 27 December 2018 / Accepted: 7 July 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract Missing values challenge data analysis because many supervised and unsupervised learning methods cannot be applied directly to incomplete data. Matrix completion based on low-rank assumptions are very powerful solution for dealing with missing values. However, existing methods do not consider the case of informative missing values which are widely encountered in practice. This paper proposes matrix completion methods to recover Missing Not At Random (MNAR) data. Our first contribution is to suggest a model-based estimation strategy by modelling the missing mechanism distribution. An EM algorithm is then implemented, involving a Fast Iterative Soft-Thresholding Algorithm (FISTA). Our second contribution is to suggest a computationally efficient surrogate estimation by implicitly taking into account the joint distribution of the data and the missing mechanism: the data matrix is concatenated with the mask coding for the missing values; a low-rank structure for exponential family is assumed on this new matrix, in order to encode links between variables and missing mechanisms. The methodology that has the great advantage of handling different missing value mechanisms is robust to model specification errors. The performances of our methods are assessed on the real data collected from a trauma registry (TraumaBase® ) containing clinical information about over twenty thousand severely traumatized patients in France. The aim is then to predict if the doctors should administrate tranexomic acid to patients with traumatic brain injury, that would limit excessive bleeding. Keywords Informative missing values · Denoising · Matrix completion · Accelerated proximal gradient method · EM algorithm · Nuclear norm penalty
1 Introduction The problem of missing data is ubiquitous in the practice of data analysis. Main approaches for handling missing data include imputation methods and the use of ExpectationMaximization (EM) algorithm (Dempster et al. 1977) which Julie Josse was supported by the Data Analytics and Models for Insurance chair and Aude Sportisse was funded by a PEPS project (Projet Exploratoire Premier Soutien) of AMIES (the mathematical agency in interaction with companies and society).
B
Aude Sportisse [email protected]
1
Laboratoire de Probabilités Statistique et Modélisation, Sorbonne Université, Paris, France
2
Centre de Mathématiques Appliquées, Ecole Polytechnique, Palaiseau, France
3
Département de Mathématiques et applications, Ecole Normale Supérieure, Paris, France
4
XPOP, INRIA Saclay, Palaiseau, France
allows to get the maximum likelihood estimators in various incomplete-data problems (Little and Rubin 2014). The theoretical guarantees of these methods ensuring the correct prediction of missing values or the correct estimation of some parameters o
Data Loading...