Introduction

Spatial Statistics

  • Taking into account spatial dependence possibly present in data is a foremost aspect of spatial statistics.

  • General spatial model (Cressie 1993): \(\{ Z(\mathbf{s}) \; : \; \mathbf{s} \in D \}\), where \(D\) is an index set.

  • Statistical inference depends heavily on the spatial structure/geometry of the observed spatial data.

Geometry Branch Index set
Points Geostatistics Continuum
Areas/polygons Areal models Countable
Mixed Spatial Data Fusion Continuum

Classic Models for Spatial Data

  • Areal data

  • Point-referenced data

    • Gaussian processes (GP)
  • Mixed/fused data

    • “Aggregated” GP (AGP) (Moraga et al. 2017): \[ Z(\mathbf{s}_i) = \begin{cases} {\mathcal{A}(\mathbf{s}_i)}^{-1} \int_{\mathcal{A}(\mathbf{s}_i)} Z(\mathbf{s}) \mathrm{d}\mathbf{s}, & \text{if } \mathcal{A}(\mathbf{s}_i) > 0, \\ Z(\mathbf{s}_i), & \text{otherwise,} \end{cases} \]

Research Questions

  • The main research questions we are interested in are:

    1. Can we propose a model for spatial data that accomodates areal, point-referrenced, and fused data?

    2. If so, is this model competitive when compared to specialized models?

  • Proposal: An isotropic GP defined on a flexible index set.

  • Main challenge: Defining a valid correlation function.

Hausdorff–Gaussian Process (HGP)

Distances between Sets

  • Metric space: \((D, d)\), where \(D\) is a spatial region of interest.

  • Distance between a point and a set: \(d(x, A) = \inf_{a \in A} d(x, a)\), where \(d(x, y)\) is the distance between any two elements \(x, y \in D\)

  • Directed Hausdorff distance: \[{\vec h}(A, B) = \sup_{a \in A} d(a, B)\]

  • Hausdorff distance: \[h(A, B) = \max \left \{ \vec{h}(A, B), \vec{h}(B, A) \right \}\]

Hausdorff Distance for Spatial Data Analysis

  • Study region: In spatial statistics, \(D\) is tipically a closed and bounded subset of \(\mathbb{R}^2\).

  • In this context, the Hausdorff distance is a metric (Sendov 2004).

The HGP

  • Index set: \(\mathcal{B}(D)\) represents the closed and bounded subsets of \(D \subset \mathbb{R}^2\).

  • \(Z(\mathbf{s}) \sim \mathrm{HGP}\{m(\mathbf{s}), v(\mathbf{s}), r(h)\}\), where \(\mathbf{s} \in \mathcal{B}(D)\).

  • Mean function: \(m(\mathbf{s}) = \mathbb{E}[Z(\mathbf{s})]\)

  • SD function: \(v(\mathbf{s}) = \sqrt{\mathrm{Var}(Z(\mathbf{s}))}\)

  • Correlation function: \(r(h) = \mathrm{Cor}(Z(\mathbf{s}_1), Z(\mathbf{s}_2)),\) where \(h\) denotes the Hausdorff distance between \(\mathbf{s}_1, \mathbf{s}_2 \in \mathcal{B}(D)\).

  • The induced covariance function is given by \(\mathrm{Cov}(Z(\mathbf{s}_1), Z(\mathbf{s}_2)) = v(\mathbf{s}_1) v(\mathbf{s}_2) r(h(\mathbf{s}_1, \mathbf{s}_2))\)

The Powered Exponential Correlation (PEC) Function

  • PEC function: \[r(h; \phi, \nu) = \exp\left \{ - \frac{h^{\nu}}{\phi^{\nu}}\right \},\] where \(\nu\) is a smoothness parameter and \(\phi\) governs the range of the spatial dependence.

  • Parametrization: We reparametrize this function with \(\rho = {\log(10)}^{1 / \nu} \phi\).

  • Interpretation: \(\rho\) is the distance at which the spatial correlation reduces to \(0.10\).

Theoretical Foundation: Positive Definiteness of the PEC

Theorem

Let \(h = h(\mathbf{s}, \mathbf{s}')\) be the Hausdorff distance between two spatial units, denoted \(\mathbf{s}, \mathbf{s}' \in \mathcal{B}(D)\), where \(D \subset \mathbb{R}^2\). The powered exponential correlation function \(\exp \{ - h^{\nu} / \phi^{\nu} \}\) is positive definite for \(\nu \in (1/2, 1)\).

  • The theorem above guarantees the validity of the HGP equipped with a PEC function with \(\nu \in (1/2, 1)\).

  • The proof is based on embedding the Hausdorff distance into a high-dimensional \(L_1\) normed Euclidean space, and using the fact that the exponential correlation function is positive definite on this space.

HGP Recap

  • Flexibility: A process that handles point-referrenced, areal, and mixed spatial data by construction.

  • Hausdorff distance: Enables HGP’s correlation function to account for the shape, size, and orientation of spatial objects.

  • Validity: Using a PEC function ensures the HGP is a valid process.

flowchart LR
    A(Flexible index set) ==> B(GP)
    B ==> E(((HGP)))
    C(Hausdorff distance) ==> D(PEC)
    D ==> E

Spatial modeling

Spatial GLMM

A generalized linear mixed effects model (GLMM) can be written as \[\begin{aligned} & Y(\mathbf{s}_i) \mid \mathbf{x}_i, Z(\mathbf{s}_i) \overset{{\rm ind}}{\sim} f(\cdot \mid \mu_i, \boldsymbol{\gamma}) \\ & g(\mu_i) = \mathbf{x}_i \boldsymbol{\beta} + Z(\mathbf{s}_i). \end{aligned}\]

  • Probability distribution: \(f(\cdot)\)

  • Link function: \(g(\cdot)\)

  • Conditional mean: \(\mu_i = \mathbb{E}[Y(\mathbf{s}_i) \mid \mathbf{x}_i, Z(\mathbf{s}_i)]\)

  • Model parameters: \(\boldsymbol{\theta} = {\{\boldsymbol{\beta}^\top, \boldsymbol{\sigma}^\top, \boldsymbol{\delta}^\top, \boldsymbol{\gamma}^\top \}}^\top\)

  • Joint density: \(p(\mathbf{y} \mid \mathbf{z}, \boldsymbol{\theta}) = \prod_{i = 1}^n f(y(\mathbf{s}_i) \mid \mu_i, \boldsymbol{\gamma})\)

Bayesian Estimation & Prediction

  • Priors: \[\begin{align*} & \boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0}, 10 \mathbf{I}) & \mathbf{Z} \sim \mathrm{HGP}\{0, v(\cdot), r(\cdot) \} & \\ & \rho \sim \mathrm{Exp}(a_\rho) & \alpha \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \qquad \text{ or } \quad & \sigma \sim t_{+}(3) \end{align*}\]

  • Posterior: \(\pi(\boldsymbol{\theta} \mid \mathbf{y}, \mathbf{z}) \propto p(\mathbf{y} \mid \mathbf{z}, \boldsymbol{\theta}) p(\mathbf{z} \mid \boldsymbol{\theta}) \pi(\boldsymbol{\theta})\)

  • MCMC sampler: No-U-Turn (Homan and Gelman 2014).

  • Convergence assessment: traceplots and split-\({\hat{R}}\) (Vehtari et al. 2021).

  • Posterior predictive distributions: \(p(\mathbf{y}^{\ast} \mid \mathbf{y})\)

Bayesian Modeling Recap

  • HGP as a prior for the random effects disribution in a GLMM.

  • Bayesian inference through MCMC.

  • Uncertainty quantification of predictions through the posterior predictive distributions.

Areal Data Application

Respiratory Disease Hospitalization in Glasgow

  • Sample units: 134 intermediate zones (IZ), where the \(i\)-th IZ is denoted \(\mathbf{s}_i\).

  • Number of hospitalizations: \(\mathbf{Y} = {(Y(\mathbf{s}_1), \ldots, Y(\mathbf{s}_n))}^\top\).

  • Expected number of hospitalizations based on the national age- and sex-standardized rates: \(E_i\).

  • Percentage of people classified as income deprived: \(\mathbf{X}\).

\[\begin{aligned} & (Y(\mathbf{s}_i) \mid x_i, z(\mathbf{s}_i)) \sim \text{Poisson}(E_i \mu_i) \\ & \log(\mu_i) = \beta_0 + x_i \beta_1 + z(\mathbf{s}_i) \end{aligned}\]

Disease Mapping: Estimation & GOF

HGP DAGAR
\(\beta_0\) -0.21 (-0.268, -0.139) -0.26 (-0.450, -0.137)
\(\beta_1\) 0.33 (0.284, 0.368) 0.31 (0.258, 0.370)
\(\sigma\) 0.19 (0.155, 0.234) 0.30 (0.218, 0.484)
\(\rho\) 2.25 (0.159, 6.948)
\(\psi\) 0.43 (0.069, 0.827)
LOOIC 1081.0 1081.9
WAIC 1038.0 1032.4

Disease Mapping: Spatial Dependence

Disease Mapping: Adjusted SIR

Data Fusion Application

Air Pollution in Ventura and Los Angeles

  • Point-referrenced data from 19 measurement stations available daily from 1999 to date;

  • Satellite-derived estimates (2010–2012) at 184 areal units.

  • PM2.5: \(\mathbf{Y} = {(Y(\mathbf{s}_1), \ldots, Y(\mathbf{s}_n))}^\top\).

  • Model: \((Y(\mathbf{s}_i) \mid z(\mathbf{s}_i)) \sim \mathcal{N}(\beta_0 + z(\mathbf{s}_i), \tau^2)\)

 

AGP Approximation

Air Pollution: Estimation and Prediction Assessment

HGP \(\rm AGP_1\) \(\rm AGP_2\)
\(\beta\) 5.61 (4.69, 6.45) 6.22 (2.16, 10.10) 6.19 (5.88, 6.48)
\(\rho\) 13.83 (7.82, 23.61) 13.16 (5.14, 30.24) 0.63 (0.46, 0.83)
\(\tau\) 0.18 (0.07, 0.31) 1.39 (1.24, 1.56) 0.54 (0.39, 0.72)
\(\sigma\) 3.85 (2.92, 5.18) 1.70 (0.96, 2.91) 2.41 (2.024, 2.84)
\(\sigma_a\) 1.24 (1.04, 1.51)
RMSP 1.05 1.45 1.64
Width 3.57 2.53 9.67
CPP 95.5 78.6 95.5
IS 4.80 15.11 13.65

Air Pollution: Change-of-Support

Applications: key findings

  • The proposed method consistently demonstrated performance comparable to specialized models tailored for areal and fused data.

  • Unlike traditional areal models, the HGP’s marginal variances are independent of the number of neighbors.

  • The HGP model simplifies data fusion by bypassing the need to define arbitrary grids for numerical integral evaluation, eliminating this step each time the joint probability distribution of the data and parameters is calculated.

  • Across both applications, the HGP provides an interpretable spatial dependence parameter and a spatial correlation function.

Closing remarks

Highlights

The HGP has proven to be a powerful model that offers:

  • Versatility: accomodates diverse spatial data types.

  • Performance: competitive against models designed for specific spatial data types.

  • Reliable predictions: prediction intervals with near nominal frequentist coverage.

  • Our conclusions are further supported by a comprehensive simulation study, detailed in our available preprint.

Future work and Limitations

  • Extensions:
    • non-Euclidean spaces
    • Big data
  • Limitations:
    • Big “n” problem inherited from geostatistics
    • Unclear how to incorporate anisotropy
    • Difficulties in obtaining spectral densities for correlation functions

References

Besag, J. (1974), “Spatial interaction and the statistical analysis of lattice systems,” Journal of the Royal Statistical Society. Series B (Methodological), JSTOR, 192–236.
Besag, J., York, J., and Mollié, A. (1991), “Bayesian image restoration, with two applications in spatial statistics,” Annals of the Institute of Statistical Mathematics, 43, 1–20.
Cressie, N. (1993), Statistics for spatial data, Wiley series in probability and statistics, Wiley.
Datta, A., Banerjee, S., Hodges, J. S., and Gao, L. (2019), “Spatial disease mapping using directed acyclic graph auto-regressive (DAGAR) models,” Bayesian analysis, NIH Public Access, 14, 1221.
Diggle, P. J., Tawn, J. A., and Moyeed, R. A. (1998), “Model-based geostatistics,” Journal of the Royal Statistical Society Series C: Applied Statistics, Oxford University Press, 47, 299–350.
Homan, M. D., and Gelman, A. (2014), “The No-U-turn sampler: Adaptively setting path lengths in hamiltonian Monte Carlo,” Journal of Machine Learning Research, JMLR.org, 15, 1593–1623.
Li, D., Tang, W., and Banerjee, S. (2023), “Inference for Gaussian processes with Matérn covariogram on compact Riemannian manifolds,” Journal of Machine Learning Research, 24, 1–26.
Min, D., Zhilin, L., and Xiaoyong, C. (2007), “Extended Hausdorff distance for spatial objects in GIS,” International Journal of Geographical Information Science, Taylor & Francis, 21, 459–475.
Moraga, P., Cramb, S. M., Mengersen, K. L., and Pagano, M. (2017), “A geostatistical model for combined analysis of point-level and area-level data using INLA and SPDE,” Spatial Statistics, Elsevier, 21, 27–41.
Sendov, B. (2004), “Hausdorff distance and image processing,” Russian Mathematical Surveys, IOP Publishing, 59, 319. https://doi.org/10.1070/RM2004v059n02ABEH000721.
Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., and Bürkner, P.-C. (2021), “Rank-normalization, folding, and localization: An improved \(\hat{R}\) for assessing convergence of MCMC (with discussion),” Bayesian Analysis, International Society for Bayesian Analysis, 16, 667–718.

Thank you!