Robust high-dimensional regression for data with anomalous responses

PDF / 2,643,450 Bytes
34 Pages / 439.37 x 666.142 pts Page_size
79 Downloads / 355 Views

Robust high‑dimensional regression for data with anomalous responses Mingyang Ren1,2 · Sanguo Zhang1,2 · Qingzhao Zhang3 Received: 10 March 2020 / Revised: 28 June 2020 © The Institute of Statistical Mathematics, Tokyo 2020

Abstract The accuracy of response variables is crucially important to train regression models. In some situations, including the high-dimensional case, response observations tend to be inaccurate, which would lead to biased estimators by directly fitting a conventional model. For analyzing data with anomalous responses in the high-dimensional case, in this work, we adopt γ-divergence to conduct variable selection and estimation methods. The proposed method possesses good robustness to anomalous responses, and the proportion of abnormal data does not need to be modeled. It is implemented by an efficient coordinate descent algorithm. In the setting where the dimensionality p can grow exponentially fast with the sample size n, we rigorously establish variable selection consistency and estimation bounds. Numerical simulations and an application on real data are presented to demonstrate the performance of the proposed method. Keywords Anomalous responses · Robust · γ-divergence · High-dimensional data

Electronic supplementary material The online version of this article (https://doi.org/10.1007/s1046 3-020-00764-1) contains supplementary material, which is available to authorized users. * Qingzhao Zhang [email protected] 1

School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China

2

Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing 100049, China

3

Department of Statistics, School of Economics, The Wang Yanan Institute for Studies in Economics, MOE Key Lab of Economics and Fujian Key Lab of Statistics, Xiamen University, Xiamen 361005, China

13

Vol.:(0123456789)

M. Ren et al.

1 Introduction In the regression model, the prediction rule is to be derived from labeled dataset. Traditional regression models assume and expect the correct response variables; however, it is expensive and difficult to obtain accurate responses because of insufficient information, subjective judgment, measurement error and so on, which would lead to biased estimators by directly fitting conventional methods (Piepel 2005). Anomalous responses would be encountered in the fields of Internet, finance, image processing, biology and so on. For instance, the real data studied in our paper contains mislabeled responses owing to the measurement error of the expression of receptor genes (Lopes et al. 2018). Traditional regression models are not applicable to this kind of data. It is noteworthy that “mislabeled data” in discrete variables like that is an important special case of anomalous responses, which is also called “label noise” (Rebbapragada and Brodley 2007; Frénay and Verleysen 2013) or “misclassification” (Copeland et al. 1977; Grace 2017) in classification problems and “count error” (Cameron and Trivedi 2013) in count dat

Data Loading...

Robust high-dimensional regression for data with anomalous responses

Recommend Documents

Classical and Robust Regression Analysis with Compositional Data

Robust Linear Regression for Undrained Shear Strength Data

Robust prediction and extrapolation designs for nonlinear regression with imprecision

Nonparametric quantile regression estimation for functional data with responses missing at random

Robust Techniques for Data Preprocessing

Testing for Breaks in Regression Models with Dependent Data

GMM Marginal Regression Models for Correlated Data with Grouped Moments

GMM Regression Models for Correlated Data with Unit Moments

Robust doubly protected estimators for quantiles with missing data

Regression-Based Sensitivity Analysis and Robust Design

Partitioned GMMLogistic Regression Models for Longitudinal Data

Regression Extension Techniques for Time-Series Data