An R Package for Generating Covariance Matrices for Maximum-Entropy Sampling from Precipitation Chemistry Data

  • PDF / 4,909,719 Bytes
  • 21 Pages / 439.642 x 666.49 pts Page_size
  • 54 Downloads / 247 Views

DOWNLOAD

REPORT


An R Package for Generating Covariance Matrices for Maximum-Entropy Sampling from Precipitation Chemistry Data Hessa Al-Thani1 · Jon Lee1 Received: 23 December 2019 / Accepted: 20 March 2020 / © Springer Nature Switzerland AG 2020

Abstract We present an open-source R package (MESgenCov v 0.1.0) for temporally fitting multivariate precipitation chemistry data and extracting a covariance matrix for use in the MESP (maximum-entropy sampling problem). We provide multiple functionalities for modeling and model assessment. The package is tightly coupled with NADP/NTN (National Atmospheric Deposition Program/National Trends Network) data from their set of 379 monitoring sites, 1978–present. The user specifies the sites, chemicals, and time period desired, fits an appropriate user-specified univariate model for each site and chemical selected, and the package produces a covariance matrix for use by MESP algorithms. Keywords Maximum-entropy sampling · Covariance matrix · Environmental monitoring · Environmetrics · NADP · NTN

1 Introduction The MESP (maximum-entropy sampling problem) (see [8, 16, 23, 24]) has been applied to many domains where the objective is to determine a “most informative” subset YS , of pre-specified size s = |S| > 0, from a Gaussian random vecor YN , |N| = n > s. Information is typically measured by (differential) entropy. Generally, we assume that YN has a joint Gaussian distribution with mean vector μ and covariance matrix C. Up to constants, the entropy of YS is the log of the determinant of the

 Jon Lee

[email protected] Hessa Al-Thani [email protected] 1

Department of Industrial and Operations Engineering, University of Michigan, Ann Arbor, MI 48109, USA

17

Page 2 of 21

SN Operations Research Forum

(2020) 1:17

principle submatrix C[S, S]. So, the MESP seeks to maximize the (log) determinant of C[S, S], for some S ⊆ N with |S| = s. The MESP is NP-hard (see [14]), and there has been considerable work on algorithms aimed at exact solutions for problems of moderate size; see [1–5, 7, 12, 14, 15, 17]. All of this algorithmic work is based on a branch-and-bound framework introduced in [14], and the bulk of the contributions in these references is on different methods for upper bounding the optimal value. This work has been developed and validated in the context of a very small number of data sets, despite the fact that of course multivariate data is widely available. The reason for this shortcoming is that despite all of the raw multivariate data that is available, it is not a simple matter to turn this data into meaningful covariance matrices for Gaussian random variables. Our goal with the R package (MESgenCov v 0.1.0) that we have developed is to provide such a link — between readily available raw environmental-monitoring data and covariance matrices suitable for the MESP — in the context of environmental monitoring. Our work fits squarely into recent efforts to better exploit massive amounts of available data for mathematical-programming approaches to decision problems. Even if we have r