Semi-automated simultaneous predictor selection for regression-SARIMA models

  • PDF / 1,493,355 Bytes
  • 20 Pages / 595.276 x 790.866 pts Page_size
  • 55 Downloads / 170 Views

DOWNLOAD

REPORT


Semi-automated simultaneous predictor selection for regression-SARIMA models Aaron P. Lowther1 · Paul Fearnhead1 · Matthew A. Nunes2

· Kjeld Jensen3

Received: 17 October 2019 / Accepted: 27 August 2020 © The Author(s) 2020

Abstract Deciding which predictors to use plays an integral role in deriving statistical models in a wide range of applications. Motivated by the challenges of predicting events across a telecommunications network, we propose a semi-automated, joint model-fitting and predictor selection procedure for linear regression models. Our approach can model and account for serial correlation in the regression residuals, produces sparse and interpretable models and can be used to jointly select models for a group of related responses. This is achieved through fitting linear models under constraints on the number of nonzero coefficients using a generalisation of a recently developed mixed integer quadratic optimisation approach. The resultant models from our approach achieve better predictive performance on the motivating telecommunications data than methods currently used by industry. Keywords Best subset selection · Linear regression · Mixed integer quadratic optimisation · Multivariate response model

1 Introduction The use of statistical models to drive business efficiency is becoming increasingly widespread (Proost and Fawcett 2013). Consequently, organisations are recording more and more data for subsequent analysis (see Katal et al. (2013) or Jordan and Mitchel (2015) for a review of current modelling challenges in this area). As a result, traditional (manual) approaches for building statistical models are often infeasible for the ever-increasing volumes of data. Automating these approaches is thus necessary and will allow principled statistical methods to continue being at the forefront of business practice. Our work is motivated by challenges faced by an industrial collaborator. In various parts of the business, diagnostic applications rely on the interpretability of models to guide investment or improvement programmes that correct for impact of important predictors. In these applications,

B

Matthew A. Nunes [email protected]

1

Department of Mathematics and Statistics, Lancaster University, Lancaster LA1 4YF, UK

2

School of Mathematics, University of Bath, Bath BA2 7AY, UK

3

BT Applied Research, BT Plc, London EC1A 7AJ, UK

e.g. modelling building-level energy consumption, accurate demand predictions allow effective capacity planning and efficient maintenance scheduling. In this article, we focus on one such application representative of a typical industrial modelling challenge. The data we consider consist of daily events from multiple locations within a telecommunications network. Telecommunications events are often influenced by external predictors, for example, weather variables. The relationship between the predictors and the observed response variables is often complex and nonlinear, and the number of such predictors of events considered for a model in this setting can be in the