Variable selection techniques after multiple imputation in high-dimensional data

  • PDF / 708,269 Bytes
  • 28 Pages / 439.37 x 666.142 pts Page_size
  • 13 Downloads / 208 Views

DOWNLOAD

REPORT


Variable selection techniques after multiple imputation in high-dimensional data Faisal Maqbool Zahid1

· Shahla Faisal1 · Christian Heumann2

Accepted: 12 October 2019 © Springer-Verlag GmbH Germany, part of Springer Nature 2019

Abstract High-dimensional data arise from diverse fields of scientific research. Missing values are often encountered in such data. Variable selection plays a key role in highdimensional data analysis. Like many other statistical techniques, variable selection requires complete cases without any missing values. A variety of variable selection techniques for complete data is available, but similar techniques for the data with missing values are deficient in the literature. Multiple imputation is a popular approach to handle missing values and to get completed data. If a particular variable selection technique is applied independently on each of the multiply imputed datasets, a different model for each dataset may be the result. It is still unclear in the literature how to implement variable selection techniques on multiply imputed data. In this paper, we propose to use the magnitude of the parameter estimates of each candidate predictor across all the imputed datasets for its selection. A constraint is imposed on the sum of absolute values of these estimates to select or remove the predictor from the model. The proposed method for identifying the informative predictors is compared with other approaches in an extensive simulation study. The performance is compared on the basis of the hit rates (proportion of correctly identified informative predictors) and the false alarm rates (proportion of non-informative predictors dubbed as informative) for different numbers of imputed datasets. The proposed technique is simple and easy to implement, and performs equally well in the high-dimensional case as in the lowdimensional settings. The proposed technique is observed to be a good competitor to the existing approaches in different simulation settings. The performance of different variable selection techniques is also examined for a real dataset with missing values.

B

Faisal Maqbool Zahid [email protected] Shahla Faisal [email protected] Christian Heumann [email protected]

1

Department of Statistics, Government College University Faisalabad, Faisalabad, Pakistan

2

Department of Statistics, Ludwig-Maximilians-University Munich, Munich, Germany

123

F. M. Zahid et al.

Keywords High-dimensional data · Multiple imputation · LASSO · Rubin’s rules · Variable selection

1 Introduction High-dimensional data arise from diverse fields of scientific research. The problems associated with the high-dimensional data analysis (Hastie et al. 2009) are often challenging in different fields of research, such as genomics, medicine, health sciences, environmental sciences, economics, finance, social surveys and machine learning. Variable selection plays a critical role in high-dimensional data analysis. A significant advancement in the variable selection techniques has evolved in