Variable selection for linear regression in large databases: exact methods

PDF / 1,149,663 Bytes
21 Pages / 595.276 x 790.866 pts Page_size
8 Downloads / 236 Views

Variable selection for linear regression in large databases: exact methods Joaquín Pacheco 1

&

Silvia Casado 1

Accepted: 2 September 2020 # Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract This paper analyzes the variable selection problem in the context of Linear Regression for large databases. The problem consists of selecting a small subset of independent variables that can perform the prediction task optimally. This problem has a wide range of applications. One important type of application is the design of composite indicators in various areas (sociology and economics, for example). Other important applications of variable selection in linear regression can be found in fields such as chemometrics, genetics, and climate prediction, among many others. For this problem, we propose a Branch & Bound method. This is an exact method and therefore guarantees optimal solutions. We also provide strategies that enable this method to be applied in very large databases (with hundreds of thousands of cases) in a moderate computation time. A series of computational experiments shows that our method performs well compared to well-known methods in the literature and with commercial software. Keywords Variable selection . Linear regression . Branch & Bound methods . Heuristics

1 Introduction 1.1 Motivation Research very often involves analyzing datasets with one dependent variable and multiple independent variables (“response” variable and “predictor” variables), giving a dataset that is multivariate and multidimensional. Frequently these analysis have been based on traditional models such as Multiple Linear Regression. Other more recent methods are based on neural networks, support vector machines, nearest neighbor, etc. In the simplest case, multiple linear regression involves a regression of the dependent variable with respect to the set of predictor variables. Although this full model regression approach might seem logical, there are several key problems. One of the most important is the following: having multiple predictors in a model adds noise to the analysis, with the effect that non-significant results may be returned, even when

* Joaquín Pacheco [email protected] Silvia Casado [email protected] 1

Department of Applied Economics, University of Burgos, Burgos, Spain

the model contains significant predictors [1]. Moreover, it is commonly assumed that only a small proportion of the predictor variables are truly influential to the response [2]. Recent improvements to data-collection technologies have resulted in complex regression problems in which the number of candidate predictor variables explaining the response variable may be very large. However, not all of the variables are equally relevant to this task and many of them repeat the information that they contribute. So, in regression, when the predictor vector contains many variables, variable selection becomes necessary, to improve the precision of a model fit. The variable selection process attempts to identify the “best” subset of predi

Data Loading...

Variable selection for linear regression in large databases: exact methods

Recommend Documents

Variable Selection and Estimation in Kink Regression Model

Variable selection for generalized partially linear models with longitudinal data

Bayesian Variable Selection for Linear Models Using I-Priors

Variable Selection

Variable Selection

Regression Methods in Biostatistics Linear, Logistic, Survival, and

Variable Selection

Linear Regression

Linear Regression

Regression Methods in Biostatistics Linear, Logistic, Survival, and

Linear Regression

Linear Regression