Most recent changepoint detection in censored panel data

  • PDF / 458,908 Bytes
  • 26 Pages / 439.37 x 666.142 pts Page_size
  • 92 Downloads / 178 Views

DOWNLOAD

REPORT


Most recent changepoint detection in censored panel data Hajra Siddiqa1 · Sajid Ali1

· Ismail Shah1

Received: 29 October 2019 / Accepted: 19 August 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract This study aims to detect the most recent changepoint in censored panel data by ignoring dependence within and between segments as well as taking into account the serial autocorrelation. A comparison of different methods to detect the most recent changepoint for censored data is presented. Different censoring rates such as 20%, 50%, and 90% in the case of right and left censoring while (10%, 10%), (25%, 25%) and (40%, 50%) for interval censoring are considered. Further, we use most recent changepoint (MRC), double cumulative sum binary segmentation, non parametric changepoint detection (ECP), multiple changepoints in multivariate time series, analyzing each series in the panel independently, and analyzing aggregated data (AGG) methods. It is observed that different censoring rates have a significant effect on the detection of changepoints in high dimensional data. It is also noticed that the MRC method outperforms the competing methods considered in this study. In addition to investigating the impact of penalties, the performance of MRC and AGG methods is also compared using water quality data of the Niagara River. Also, a data set related to survival time of stroke patients is also a part of this study. An R package “cpcens” is available in comprehensive R archive network to replicate the results of this article. Keywords Change point · Panel censored data · High dimensional data · CUSUM · Binary segmentation · Cost function

Electronic supplementary material The online version of this article (https://doi.org/10.1007/s00180020-01028-5) contains supplementary material, which is available to authorized users.

B

Sajid Ali [email protected] Hajra Siddiqa [email protected] Ismail Shah [email protected]

1

Department of Statistics, Quaid-i-Azam University, Islamabad 45320, Pakistan

123

H. Siddiqa et al.

1 Introduction High-dimensional data are characterized by multiple dimensions, i.e., attributes of a particular data set. In simple words, if the number of features exceeds the number of observations then it is called the high dimensional data. In many applications, for instance, banking and insurance, geographical data analysis, medical and gene expression data, high-dimensional data are often observed and thus, have become increasingly important. As high dimensional observations for the same subject are gathered and stored over time, high dimensional panel data terminology is commonly used in the literature (Wooldridge 2010; Bardwell 2018; Aston and Kirch 2014). An important issue with high-dimensional data is the partial information collection known as the censored data due to time or cost restriction. For example, we are interested in observing the effect of the medicine on the occurrence of stroke in a clinical trial where the trial period is of 5 years. Then, there is a po