Multi-scale process modelling and distributed computation for spatial data

  • PDF / 5,164,732 Bytes
  • 19 Pages / 595.276 x 790.866 pts Page_size
  • 63 Downloads / 181 Views

DOWNLOAD

REPORT


Multi-scale process modelling and distributed computation for spatial data Andrew Zammit-Mangion1

· Jonathan Rougier2

Received: 17 July 2019 / Accepted: 30 June 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Recent years have seen a huge development in spatial modelling and prediction methodology, driven by the increased availability of remote-sensing data and the reduced cost of distributed-processing technology. It is well known that modelling and prediction using infinite-dimensional process models is not possible with large data sets, and that both approximate models and, often, approximate-inference methods, are needed. The problem of fitting simple global spatial models to large data sets has been solved through the likes of multi-resolution approximations and nearest-neighbour techniques. Here we tackle the next challenge, that of fitting complex, nonstationary, multi-scale models to large data sets. We propose doing this through the use of superpositions of spatial processes with increasing spatial scale and increasing degrees of nonstationarity. Computation is facilitated through the use of Gaussian Markov random fields and parallel Markov chain Monte Carlo based on graph colouring. The resulting model allows for both distributed computing and distributed data. Importantly, it provides opportunities for genuine model and data scalability and yet is still able to borrow strength across large spatial scales. We illustrate a two-scale version on a data set of sea-surface temperature containing on the order of one million observations, and compare our approach to state-of-the-art spatial modelling and prediction methods. Keywords Graph colouring · Markov chain Monte Carlo · Parallel sampling · Spatial statistics

1 Introduction Large spatial/spatio-temporal data sets are now centre-stage in several of the environmental sciences such as meteorology and glaciology. Two popular tools available to the spatial statistician to deal with such data are the hierarchical model and the closely-related notion of conditional independence (Cressie and Wikle 2011, Section 2.1.5). In a two layer, linear, Gaussian, data-process model, the widely adopted assumption that data are conditionally independent, given a low-dimensional underlying process, is sufficient for developing inferential algorithms that scale linearly with the dimension of the data. Several methods capitalise on this approach for the spatial or spatio-temporal analysis of big

B

Andrew Zammit-Mangion [email protected] Jonathan Rougier [email protected]

1

School of Mathematics and Applied Statistics, University of Wollongong, Wollongong, Australia

2

School of Mathematics, University of Bristol, Bristol, UK

data; these include fixed rank kriging (Cressie and Johannesson 2008), predictive processes (Banerjee et al. 2008), and a suite of approaches based on Gaussian Markov random field (GMRF) approximations to geostatistical models (e.g., Rue and Tjelmeland 2002; Lindgren et al. 2011; Nychka et al. 2015). For spati