Context-Aware Time Series Imputation for Multi-Analyte Clinical Data

  • PDF / 668,307 Bytes
  • 16 Pages / 439.642 x 666.49 pts Page_size
  • 110 Downloads / 209 Views

DOWNLOAD

REPORT


Context-Aware Time Series Imputation for Multi-Analyte Clinical Data Kejing Yin1

· Liaoliao Feng2 · William K. Cheung1

Received: 17 August 2019 / Revised: 19 March 2020 / Accepted: 7 May 2020 / © Springer Nature Switzerland AG 2020

Abstract Clinical time series imputation is recognized as an essential task in clinical data analytics. Most models rely either on strong assumptions regarding the underlying data-generation process or on preservation of only local properties without effective consideration of global dependencies. To advance the state of the art in clinical time series imputation, we participated in the 2019 ICHI Data Analytics Challenge on Missing Data Imputation (DACMI). In this paper, we present our proposed model: Context-Aware Time Series Imputation (CATSI), a novel framework based on a bidirectional LSTM in which patients’ health states are explicitly captured by learning a “global context vector” from the entire clinical time series. The imputations are then produced with reference to the global context vector. We also incorporate a crossfeature imputation component to explore the complex feature correlations. Empirical evaluations demonstrate that CATSI obtains a normalized root mean square deviation (nRMSD) of 0.1998, which is 10.6% better than that of state-of-the-art models. Further experiments on consecutive missing datasets also illustrate the effectiveness of incorporating the global context in the generation of accurate imputations. Keywords Missing data imputation · Clinical time series · Electronic health records

 Kejing Yin

[email protected] Liaoliao Feng [email protected] William K. Cheung [email protected] 1

Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR, China

2

School of Computer Science & Technology, East China Normal University, Shanghai, China

Journal of Healthcare Informatics Research

1 Introduction The rapid development and global adoption of electronic health records (EHR) over the past decade has given researchers valuable opportunities to perform secondary analysis of the EHR data accumulated over the years. In addition to the structured data, such as diagnosis codes and medication prescriptions, the EHR data also contain clinical time series that are crucial for characterization of patients’ health conditions. In particular, bedside monitors and irregularly requested laboratory tests are often used to measure patients’ health status during their hospital stays. Therefore, modeling the clinical time series has become a critical component of healthcare data analytics, and considerable effort has been made to use clinical time series for various tasks like mortality prediction. However, the clinical time series are generally of low quality, primarily due to the complexity of clinical practice, which hinders the application of data-driven approaches [10, 14]. One major issue is missing data. Thus, it is often necessary to fill in missing values in an incomplete clinical time series, which is referred to as time series imput