Feature screening based on distance correlation for ultrahigh-dimensional censored data with covariate measurement error
- PDF / 423,862 Bytes
- 28 Pages / 439.37 x 666.142 pts Page_size
- 36 Downloads / 236 Views
Feature screening based on distance correlation for ultrahigh-dimensional censored data with covariate measurement error Li-Pang Chen1 Received: 22 May 2019 / Accepted: 3 October 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020
Abstract Feature screening is an important method to reduce the dimension and capture informative variables in ultrahigh-dimensional data analysis. Its key idea is to select informative variables using correlations between the response and the covariates. Many methods have been developed for feature screening. These methods, however, are challenged by complex features pertinent to the data collection as well as the nature of the data themselves. Typically, incomplete response caused by right-censoring and covariate measurement error are often accompanying with survival analysis. Even though many methods have been proposed for censored data, little work has been available when both incomplete response and measurement error occur simultaneously. In addition, the conventional feature screening methods may fail to detect the truly important covariates that are marginally independent of the response variable due to correlations among covariates. In this paper, we explore this important problem and propose the model-free feature screening method in the presence of the censored response and error-prone covariates. In addition, we also develop the iteration method to improve the accuracy of selecting all important covariates. Numerical studies are reported to assess the performance of the proposed method. Finally, we implement the proposed method to a real dataset. Keywords Buckley–James imputation · Marginal dependence · Mismeasurement · Model misspecification · Survival data · Ultrahigh-dimension
B 1
Li-Pang Chen [email protected] Department of Statistical and Actuarial Sciences, University of Western Ontario, 1151 Richmond St, London, ON N6A 3K7, Canada
123
L.-P. Chen
1 Introduction Ultrahigh-dimensional data appear in various scientific research areas, including genetic data, financial data, survival data, and so on. In regression analysis, ultrahighdimensional data are very difficult to analyze since they usually contain many variables that are not highly correlated to the response. In addition, the covariance matrix of the ultrahigh-dimensional variables is usually singular due to that the dimension of variables is ultra higher than the sample size. As a result, it is necessary to select informative variables and reduce the dimension of the covariates before constructing regression models. Moreover, to cope with ultrahigh dimensionality, the assumption of sparsity is imposed. In other words, there are only a small number of predicting variables associated with the response. In the early developments of variable selection, Akaike’s Information Criterion (AIC) (Akaike 1973) and Bayesian Information Criterion (BIC) (Schwarz 1978) are two well-known conventional variable selection criteria. Those two methods aim to search over all possible combinations so that the optimal sol
Data Loading...