Robust high-dimensional regression for data with anomalous responses
- PDF / 2,643,450 Bytes
- 34 Pages / 439.37 x 666.142 pts Page_size
- 79 Downloads / 223 Views
Robust high‑dimensional regression for data with anomalous responses Mingyang Ren1,2 · Sanguo Zhang1,2 · Qingzhao Zhang3 Received: 10 March 2020 / Revised: 28 June 2020 © The Institute of Statistical Mathematics, Tokyo 2020
Abstract The accuracy of response variables is crucially important to train regression models. In some situations, including the high-dimensional case, response observations tend to be inaccurate, which would lead to biased estimators by directly fitting a conventional model. For analyzing data with anomalous responses in the high-dimensional case, in this work, we adopt γ-divergence to conduct variable selection and estimation methods. The proposed method possesses good robustness to anomalous responses, and the proportion of abnormal data does not need to be modeled. It is implemented by an efficient coordinate descent algorithm. In the setting where the dimensionality p can grow exponentially fast with the sample size n, we rigorously establish variable selection consistency and estimation bounds. Numerical simulations and an application on real data are presented to demonstrate the performance of the proposed method. Keywords Anomalous responses · Robust · γ-divergence · High-dimensional data
Electronic supplementary material The online version of this article (https://doi.org/10.1007/s1046 3-020-00764-1) contains supplementary material, which is available to authorized users. * Qingzhao Zhang [email protected] 1
School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
2
Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing 100049, China
3
Department of Statistics, School of Economics, The Wang Yanan Institute for Studies in Economics, MOE Key Lab of Economics and Fujian Key Lab of Statistics, Xiamen University, Xiamen 361005, China
13
Vol.:(0123456789)
M. Ren et al.
1 Introduction In the regression model, the prediction rule is to be derived from labeled dataset. Traditional regression models assume and expect the correct response variables; however, it is expensive and difficult to obtain accurate responses because of insufficient information, subjective judgment, measurement error and so on, which would lead to biased estimators by directly fitting conventional methods (Piepel 2005). Anomalous responses would be encountered in the fields of Internet, finance, image processing, biology and so on. For instance, the real data studied in our paper contains mislabeled responses owing to the measurement error of the expression of receptor genes (Lopes et al. 2018). Traditional regression models are not applicable to this kind of data. It is noteworthy that “mislabeled data” in discrete variables like that is an important special case of anomalous responses, which is also called “label noise” (Rebbapragada and Brodley 2007; Frénay and Verleysen 2013) or “misclassification” (Copeland et al. 1977; Grace 2017) in classification problems and “count error” (Cameron and Trivedi 2013) in count dat
Data Loading...