Using MCMC for Logistic Regression Model Selection Involving Large Number of Candidate Models

Logistic regression models are commonly used for studying binary or proportional response variables. An important problem is to screen a number p of potential explanatory variables in order to select a subset of them which are most related to a response v

  • PDF / 1,788,861 Bytes
  • 15 Pages / 439.37 x 666.14 pts Page_size
  • 29 Downloads / 164 Views

DOWNLOAD

REPORT


1 2

Abstract . Logistic regression models ar e commonly used for studying bin ar y or proportional response variables. An important problem is to screen a number p of potential explanat ory vari ables in ord er to select a subset of them which ar e most rel at ed to a response variable. Several crite ria such as AIC , BIC , and stochastic complexity crite rion are available for this variable selection procedure. However , simply applying these crite ria for an exhaust ive sear ch of the best subset is computationally infeasibl e, even when p is moder at ely larg e (e.g. p = 20 which implies 220 candidat e subsets available for select ion). In this paper we propose an MCMC random search pro cedure incorporating the above crite ria to overcome the com putat iona l difficulty. Using this procedure we only need to search a sample of the candida te subsets in order to find the b est one. We hav e studied various properties of this pro cedure concern ing the convergence of the Markov cha in generated and the probability and the efficiency of selecting the optimal model. The performan ce of our pro cedure is also assessed by a simulation study.

1

Introduction

Logistic regression mod els are probably th e most important models for studying how binomial (e.g. binary and proportional) response vari ables are affected by various explanatory variables. Wh eth er or not a specific set of explanato ry variables has significant effect s on a binomial response can be investigated by a conventional hypothesis testing procedure. However if the task is to find an optimal subs et of explanatory variables for predicting or estimating the binomial respons e, the more attractive method is to proceed with explanatory vari able selection or equivalently mod el selection via the compa rison of various candida te mod els in terms of a single data-oriented mod el utility measure. Various mod el utility measures have been derived and used for general parametric model comp ari son in literature. These include AlC (Akaike 1973 * This resear ch is supported by a grant from t he Australian Research Council and

a grant from the Natural Sciences and Engineering Resear ch Council of Canada. K.‒T. Fang et al. (eds.)., Monte Carlo and Quasi-Monte Carlo Methods 2000 © Springer-Verlag Berlin Heidelberg 2002

461

and 1974), BIC (Schwarz 1978), stochastic complexity criterion (SCC) (Rissanen 1996 and Qian and Kiinsch 1998) and many ot hers . Different utility meas ures have been obtained and used because of different emphases in evaluati ng t he goodness of a model. In general t hree different but related issues may be considered concern ing t he goodness of a model; namely, pred ictability, information distance from the true model, and the model posterior likelihood . In t his paper we will not st udy how these different considerations lead to different model utili ty measures. Rath er we will simply apply t he AIC, BIC and SCC to the problem of th e logistic regression model selection and use t hese crite ria as our model utility measures. We will