Imputation of clinical covariates in time series

PDF / 3,811,925 Bytes
64 Pages / 439.37 x 666.142 pts Page_size
71 Downloads / 341 Views

Imputation of clinical covariates in time series Dimitris Bertsimas1 · Agni Orfanoudaki1 · Colin Pawlowski1 Received: 8 April 2019 / Revised: 24 August 2020 / Accepted: 10 October 2020 © The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2020

Abstract Missing data is a common problem in longitudinal datasets which include multiple instances of the same individual observed at different points in time. We introduce a new approach, MedImpute, for imputing missing clinical covariates in multivariate panel data. This approach integrates patient specific information into an optimization formulation that can be adjusted for different imputation algorithms. We present the formulation for a K-nearest neighbors model and derive a corresponding scalable first-order method med. knn. Our algorithm provides imputations for datasets with both continuous and categorical features and observations occurring at arbitrary points in time. In computational experiments on three real-world clinical datasets, we test its performance on imputation and downstream predictive tasks, varying the percentage of missing data, the number of observations per patient, and the mechanism of missing data. The proposed method improves upon both the imputation accuracy and downstream predictive performance relative to the best of the benchmark imputation methods considered. We show that this edge is consistently present both in longitudinal and electronic health records datasets as well as in binary classification and regression settings. On computational experiments on synthetic data, we test the scalability of this algorithm on large datasets, and we show that an efficient method for hyperparameter tuning scales to datasets with 10,000’s of observations and 100’s of covariates while maintaining high imputation accuracy. Keywords Missing data imputation · Time series data · Electronic health records · Longitudinal studies · Framingham heart study · K-nearest neighbors

Editor: Joao Gama. * Dimitris Bertsimas [email protected] Agni Orfanoudaki [email protected] Colin Pawlowski [email protected] 1

Operations Research Center, E40‑111, Massachusetts Institute of Technology, Cambridge, MA 02139, USA

13

Vol.:(0123456789)

Machine Learning

1 Introduction Machine learning applied to healthcare data can generate actionable insights ranging from predicting the onset of disease to streamlining hospital operations. Statistical models that leverage the variety and richness of clinical data are still relatively rare and offer an exciting avenue for further research (Callahan and Shah 2017). As an increasing amount of information becomes available the medical field expects machine learning to become an indispensable tool for clinicians (Obermeyer and Emanuel 2016). This information will come from various clinical and epidemiological sources. Claims records, clinical trials, and data from longitudinal studies have been an invaluable resource for medical research over the past decades. In many of these dataset

Data Loading...

Imputation of clinical covariates in time series

Recommend Documents

Context-Aware Time Series Imputation for Multi-Analyte Clinical Data

Issues in Adjusting for Covariates Arising Postrandomization in Clinical Trials

Time Series

Time Series

Time Series

Uncertainty Characterization for Predictive Analytics with Clinical Time Series Data

Time series

Time-Series

TIME SERIES

Reconstruction of Time Series

Memory in Time Series Features

Time-Warping Invariants of Multidimensional Time Series