STATISTICAL INFERENCES AND PREDICTIONS FOR AREAL DATA AND SPATIAL DATA FUSION WITH HAUSDORFF-GAUSSIAN PROCESSES

Available at lcgodoy.me/slides/2025-ucsf/

Lucas da Cunha Godoy

ldcgodoy@ucsc.com

EEB Department, UCSC

Department of Epidemiology & Biostatistics – UCSF

2025-03-12

Introduction

Spatial Statistics

Taking into account spatial dependence possibly present in data is a foremost aspect of spatial statistics.
General spatial model (Cressie 1993): \(\{ Z(\mathbf{s}) \; : \; \mathbf{s} \in D \}\), where \(D\) is an index set.
Statistical inference depends heavily on the spatial structure/geometry of the observed spatial data.

Geometry	Branch	Index set
Areas/polygons	Areal models	Countable
Points	Geostatistics	Continuum
Mixed	Spatial Data Fusion

Classic Models for Spatial Data

Areal data
- CAR (Besag 1974), ICAR, BYM (Besag et al. 1991), DAGAR (Datta et al. 2019).
Point-referenced data
- Gaussian processes (GP), Nearest-neighbor mixture processes (Zheng et al. 2023)
Mixed/fused data
- “Aggregated” GP (AGP) (Moraga et al. 2017): \[ Z(\mathbf{s}_i) = \begin{cases} Z(\mathbf{s}_i), & \text{if } \mathcal{A}(\mathbf{s}_i) = 0 \\ {\mathcal{A}(\mathbf{s}_i)}^{-1} \int_{\mathcal{A}(\mathbf{s}_i)} Z(\mathbf{s}) \mathrm{d}\mathbf{s}, & \text{if } \mathcal{A}(\mathbf{s}_i) > 0 \end{cases} \]

Research Questions

The main research questions we are interested in are:
1. Can we propose a model for spatial data that accomodates areal, point-referrenced, and fused data?
2. If so, is this model competitive when compared to specialized models?
Proposal: An isotropic GP defined on a flexible index set.
Main challenge: Defining a valid correlation function.

Hausdorff–Gaussian Process (HGP)

Preliminaries

Point-referenced and areal spatial units are (closed and bounded) sets.
We need to generalize distance between points to distance between sets.
Ideally, this distance should:
1. Take into account the shape, size, and orientation of spatial sample units.
2. Be “spatially interpretable”.

Distances between sets

Metric space: \((D, d)\), where \(D\) is a spatial region of interest.
Distance between a point and a set: \(d(x, A) = \inf_{a \in A} d(x, a)\), where \(d(x, y)\) is the distance between any two elements \(x, y \in D\)
Directed Hausdorff distance: \[{\vec h}(A, B) = \sup_{a \in A} d(a, B)\]
Hausdorff distance: \[h(A, B) = \max \left \{ \vec{h}(A, B), \vec{h}(B, A) \right \}\]

Hausdorff Distance for Spatial Data Analysis

Study region: In spatial statistics, \(D\) is tipically a closed and bounded subset of \(\mathbb{R}^2\).
In this context, the Hausdorff distance is a metric (Appendix C, Molchanov 2005).

The HGP

Index set: \(\mathcal{B}(D)\) represents the closed and bounded subsets of \(D \subset \mathbb{R}^2\).
\(Z(\mathbf{s}) \sim \mathrm{HGP}\{m(\mathbf{s}), v(\mathbf{s}), r(h)\}\), where \(\mathbf{s} \in \mathcal{B}(D)\).
Mean function: \(m(\mathbf{s}) = \mathbb{E}[Z(\mathbf{s})]\)
Covariance function: \(\mathrm{Cov}(Z(\mathbf{s}_1), Z(\mathbf{s}_2)) = v(\mathbf{s}_1) v(\mathbf{s}_2) r(h(\mathbf{s}_1, \mathbf{s}_2))\)
SD function: \(v(\mathbf{s}) = \sqrt{\mathrm{Var}(Z(\mathbf{s}))}\)
Correlation function: \(r(h) = \mathrm{Cor}(Z(\mathbf{s}_1), Z(\mathbf{s}_2)),\) where \(h\) denotes the Hausdorff distance between \(\mathbf{s}_1, \mathbf{s}_2 \in \mathcal{B}(D)\).

Ideally, we want bounded, compact, and non-empty sets for the Hausdorff distance to be a metric..
In \(\mathbb{R}^2\), a compact subset is bounded.
empty is a subset of any set -> it is bounded bc a subset of a bounded set is bounded itself.
Mean and SD functions can be informed by covariates.
The ensure the validity of the process, the function \(r(\cdot)\) must be positive definite.
Complex functional parametric space.
Inference is simplified (or possible) by assuming parametric forms to those functions.
Mean and SD functions can be informed by covariates.
The ensure the validity of the process, the function \(r(\cdot)\) must be positive definite.
Complex functional parametric space.
Inference is simplified (or possible) by assuming parametric forms to those functions.
The SD function allows the HGP to accommodate both homoscedastic and heteroscedastic scenarios.

The \(v(\mathbf{s})\) function

We may defined: \[v(\mathbf{s}) = \exp \{ \alpha_0 + \alpha_1 w(\mathbf{s}) \},\] where \(w(\mathbf{s})\) is a covariate available for any \(\mathbf{s} \in \mathcal{B}(D)\).

Useful special cases:
- Homoscedastic: \(w(\mathbf{s}) = 0\) (consequence, \(\sigma = \exp \{ \alpha_0 \}\))
- Data Fusion: \(w(\mathbf{s}) = \mathbb{1}(\mathcal{A}(\mathbf{s}) > 0)\).
- Area dependent: \(w(\mathbf{s}) = \mathcal{A}(\mathbf{s})\)
Although flexible, one has to be careful when choosing this function to ensure the process validity (Palacios and Steel 2006).

Ensuring the process’ validity

For a valid process, its correlation function must satisfy the following properties:

Diminish with increasing distance: \(\lim_{h \to \infty}r(h) = 0\).
Bounded from above by 1: \(r(0) = 1\).
Positive definiteness (PD): yields positive definite correlation matrices for all its finite-dimensional marginal distributions.
Unfortunately, functions that are guaranteed to be positive definite on \((\mathbb{R}^2, \lVert \cdot \rVert_2)\) are not necessarily positive definite on other metric spaces (Li et al. 2023).

PD and Empirical Assessment

Definition: A real function \(k\) on a metric space \((D, d)\) is PD if it is continuous and satisfies:

\[ \sum_{i = 1}^{m} \sum_{j = 1}^{m} a_i a_j k(\mathbf{s}_i, \mathbf{s}_j) \geq 0 \]

for real numbers \(a_1, \ldots, a_m\) and \(\mathbf{s}_1, \ldots, \mathbf{s}_m \in D\).
Necessary condition: All the eigenvalues of the resulting correlation matrix must be non-negative.
Empirical assessment: We can check for PD by examining the smallest eigenvalue of the correlation matrix across the parameter space.

The Powered Exponential Correlation (PEC) Function

PEC function: \[r(h; \phi, \nu) = \exp\left \{ - \frac{h^{\nu}}{\phi^{\nu}}\right \},\] where \(\nu\) is a smoothness parameter and \(\phi\) governs the range of the spatial dependence.

Parametrization: We reparametrize this function with \(\rho = {\log(10)}^{1 / \nu} \phi\).
Interpretation: \(\rho\) is the distance at which the spatial correlation reduces to \(0.10\).

Visualizing the PEC function

Empirical Assessment of PEC’s Positive Definiteness

HGP Recap

Flexibility: A process that handles point-referrenced, areal, and mixed spatial data by construction.
Hausdorff distance: Enables HGP’s correlation function to account for the shape, size, and orientation of spatial objects.
PEC Function Validity: A valid HGP depends on selecting an appropriate PEC correlation function.
Empirical assessments: When analyzing a new dataset or trying a new correlation function, it is imperative to empirically assess the validity of the function.

Spatial modeling

Spatial GLMM

A generalized linear mixed effects model (GLMM) can be written as \[\begin{aligned} & Y(\mathbf{s}_i) \mid \mathbf{x}_i, Z(\mathbf{s}_i) \overset{{\rm ind}}{\sim} f(\cdot \mid \mu_i, \boldsymbol{\gamma}) \\ & g(\mu_i) = \mathbf{x}_i \boldsymbol{\beta} + Z(\mathbf{s}_i). \end{aligned}\]

Probability distribution: \(f(\cdot)\)
Conditional mean: \(\mu_i = \mathbb{E}[Y(\mathbf{s}_i) \mid \mathbf{x}_i, Z(\mathbf{s}_i)]\)
Link function: \(g(\cdot)\)
Model parameters: \(\boldsymbol{\theta} = {\{\boldsymbol{\beta}^\top, \boldsymbol{\sigma}^\top, \boldsymbol{\delta}^\top, \boldsymbol{\gamma}^\top \}}^\top\)
Joint density: \(p(\mathbf{y} \mid \mathbf{z}, \boldsymbol{\theta}) = \prod_{i = 1}^n f(y(\mathbf{s}_i) \mid \mu_i, \boldsymbol{\gamma})\)

Bayesian Inference & Model Assessment

Posterior: \(\pi(\boldsymbol{\theta} \mid \mathbf{y}, \mathbf{z}) \propto p(\mathbf{y} \mid \mathbf{z}, \boldsymbol{\theta}) p(\mathbf{z} \mid \boldsymbol{\theta}) \pi(\boldsymbol{\theta})\)
MCMC sampler: No-U-Turn (Homan and Gelman 2014).
Convergence assessment: traceplots and split-\({\hat{R}}\) (Vehtari et al. 2021).
Goodness-of-fit criteria: LOOIC (lower values indicate better fit)
Posterior predictive distributions: \(p(\mathbf{y}^{\ast} \mid \mathbf{y})\)
Predictions assessment: Interval Score (IS) and RMSP (lower values indicate better fit)

Point and interval estimates: median and percentiles (\(0.025\) and \(0.975\)) of the marginal MCMC samples.
Parameters initialized by random sampling from their priors.
Random effects initialized from a standard normal distribution.
\(a_\rho\) is chosen so that: \(\mathbb{P}(\rho > U) = p_\rho\). In particular, \(a_\rho = - \log(p_\rho) / \rho_0\).
\(\nu\) hard to be estimated.

Predictions:

use the properties of GP and multivariate Normal distribution to obtain the closed-form distribution of the vector of spatial random effects \(\mathbf{Z}^\ast\) (Diggle et al. 1998);
sample \(\mathbf{z}^\ast_{(b)}\) from the distribution derived in the previous step;
sample \(\mathbf{y}^{\ast}_{(b)}\) from \(p(\mathbf{y}^{\ast} \mid \boldsymbol{\theta}_{(b)}, \mathbf{z}^{\ast}_{(b)})\), where \(\boldsymbol{\theta}_{(b)}\) is the \(b\)-th MCMC sample of \(\boldsymbol{\theta}\).

Priors

Independent normal priors for the regression coefficients: \(\boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0}, 10 \mathbf{I})\)
HGP prior for the latent random effects: \(\mathbf{Z} \sim \mathrm{HGP}\{0, v(\cdot), r(\cdot) \}\)
Exponential prior for the spatial dependence parameter: \(\rho \sim \mathrm{Exp}(a_\rho)\), where \(a_{\rho} = - \log(p_{\rho}) / \rho_0\).
- \(a_\rho\) is chosen such that \(\mathbb{P}(\rho \geq \rho_0) = p_\rho\).
Homoscedastic variance: \(v(\mathbf{s}) = \sqrt{\mathrm{Var}(Z(\mathbf{s}))} = \sigma \sim t_{+}(3)\)
Heteroscedastic HGP: \(\alpha \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\), where \(v(\mathbf{s}) = \exp \{ \alpha_0 + \sum_i \alpha_i w_i(\mathbf{s}) \}\).

Bayesian Modeling Recap

HGP as a prior for the random effects disribution in a GLMM.
Bayesian inference through MCMC.
Uncertainty quantification of predictions through the posterior predictive distributions.

Areal Data Application

Respiratory Disease Hospitalization in Glasgow

\[\begin{aligned} & (Y(\mathbf{s}_i) \mid x_i, z(\mathbf{s}_i)) \sim \text{Poisson}(E_i \mu_i) \\ & \log(\mu_i) = \beta_0 + x_i \beta_1 + z(\mathbf{s}_i) \end{aligned}\]

Sample units: 134 intermediate zones (IZ), where the \(i\)-th IZ is denoted \(\mathbf{s}_i\).
Number of hospitalizations: \(\mathbf{Y} = {(Y(\mathbf{s}_1), \ldots, Y(\mathbf{s}_n))}^\top\).
Expected number of hospitalizations based on the national age- and sex-standardized rates: \(E_i\).
Percentage of people classified as income deprived: \(x_i\).

Disease Mapping: Estimation & GOF

	HGP	DAGAR
\(\beta_0\)	-0.21 (-0.268, -0.139)	-0.26 (-0.450, -0.137)
\(\beta_1\)	0.33 (0.284, 0.368)	0.31 (0.258, 0.370)
\(\sigma\)	0.19 (0.155, 0.234)	0.30 (0.218, 0.484)
\(\rho\)	2.25 (0.159, 6.948)
\(\psi\)		0.43 (0.069, 0.827)


LOOIC	1081.0	1081.9

Disease Mapping: Spatial Dependence

Disease Mapping: Adjusted SIR

Data Fusion Application

Air Pollution in Ventura and Los Angeles

\[(Y(\mathbf{s}_i) \mid z(\mathbf{s}_i)) \sim \mathcal{N}(\beta_0 + z(\mathbf{s}_i), \tau^2)\]

PM_2.5: \(\mathbf{Y} = {(Y(\mathbf{s}_1), \ldots, Y(\mathbf{s}_n))}^\top\).
Point-referrenced data from 19 measurement stations available daily from 1999 to date;
Satellite-derived estimates (2010–2012) at 184 areal units.

AGP Approximation

Air Pollution: Estimation and Prediction Assessment

	HGP	\(\rm AGP_1\)	\(\rm AGP_2\)
\(\beta\)	5.61 (4.69, 6.45)	6.22 (2.16, 10.10)	6.19 (5.88, 6.48)
\(\rho\)	13.83 (7.82, 23.61)	13.16 (5.14, 30.24)	0.63 (0.46, 0.83)
\(\tau\)	0.18 (0.07, 0.31)	1.39 (1.24, 1.56)	0.54 (0.39, 0.72)
\(\sigma\)	3.85 (2.92, 5.18)	1.70 (0.96, 2.91)	2.41 (2.024, 2.84)
\(\sigma_a\)	1.24 (1.04, 1.51)


RMSP	1.05	1.45	1.64
Width	3.57	2.53	9.67
CPP	95.5	78.6	95.5
IS	4.80	15.11	13.65

Air Pollution: Change-of-Support

Applications: key findings

The proposed method consistently demonstrated performance comparable to specialized models tailored for areal and fused data.
Unlike traditional areal models, the HGP’s marginal variances are independent of the number of neighbors.
The HGP model simplifies data fusion by bypassing the need to define arbitrary grids for numerical integral evaluation, eliminating this step each time the joint probability distribution of the data and parameters is calculated.
Across both applications, the HGP provides an interpretable spatial dependence parameter and a spatial correlation function.

Closing remarks

Highlights

The HGP has proven to be a powerful model that offers:

Versatility: accomodates diverse spatial data types.
Performance: competitive against models designed for specific spatial data types.
Reliable predictions: prediction intervals with near nominal frequentist coverage.
Our conclusions are further supported by a comprehensive simulation study, detailed in our available preprint.

Future work and Limitations

Extensions:
- non-Euclidean spaces
- Big data
Limitations:
- Big “n” problem inherited from geostatistics
- Unclear how to incorporate anisotropy
- Difficulties in obtaining spectral densities for correlation functions

References

Besag, J. (1974), “Spatial interaction and the statistical analysis of lattice systems,” Journal of the Royal Statistical Society. Series B (Methodological), JSTOR, 192–236.

Besag, J., York, J., and Mollié, A. (1991), “Bayesian image restoration, with two applications in spatial statistics,” Annals of the Institute of Statistical Mathematics, 43, 1–20.

Cressie, N. (1993), Statistics for spatial data, Wiley series in probability and statistics, Wiley.

Datta, A., Banerjee, S., Hodges, J. S., and Gao, L. (2019), “Spatial disease mapping using directed acyclic graph auto-regressive (DAGAR) models,” Bayesian analysis, NIH Public Access, 14, 1221.

Diggle, P. J., Tawn, J. A., and Moyeed, R. A. (1998), “Model-based geostatistics,” Journal of the Royal Statistical Society Series C: Applied Statistics, Oxford University Press, 47, 299–350.

Homan, M. D., and Gelman, A. (2014), “The No-U-turn sampler: Adaptively setting path lengths in hamiltonian Monte Carlo,” Journal of Machine Learning Research, JMLR.org, 15, 1593–1623.

Li, D., Tang, W., and Banerjee, S. (2023), “Inference for Gaussian processes with Matérn covariogram on compact Riemannian manifolds,” Journal of Machine Learning Research, 24, 1–26.

Min, D., Zhilin, L., and Xiaoyong, C. (2007), “Extended Hausdorff distance for spatial objects in GIS,” International Journal of Geographical Information Science, Taylor & Francis, 21, 459–475.

Molchanov, I. S. (2005), Theory of random sets, Probability and its applications, London, England: Springer. https://doi.org/10.1007/1-84628-150-4.

Moraga, P., Cramb, S. M., Mengersen, K. L., and Pagano, M. (2017), “A geostatistical model for combined analysis of point-level and area-level data using INLA and SPDE,” Spatial Statistics, Elsevier, 21, 27–41.

Palacios, M. B., and Steel, M. F. J. (2006), “Non-Gaussian Bayesian geostatistical modeling,” Journal of the American Statistical Association, Taylor & Francis, 101, 604–618.

Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., and Bürkner, P.-C. (2021), “Rank-normalization, folding, and localization: An improved \(\hat{R}\) for assessing convergence of MCMC (with discussion),” Bayesian Analysis, International Society for Bayesian Analysis, 16, 667–718.

Zheng, X., Kottas, A., and Sansó, B. (2023), “Nearest-neighbor mixture models for non-gaussian spatial processes,” Bayesian Analysis, International Society for Bayesian Analysis, 18, 1191–1222.

STATISTICAL INFERENCES AND PREDICTIONS FOR AREAL DATA AND SPATIAL DATA FUSION WITH HAUSDORFF-GAUSSIAN PROCESSES

Introduction

Spatial Statistics

Classic Models for Spatial Data

Research Questions

Hausdorff–Gaussian Process (HGP)

Preliminaries

Distances between sets

Hausdorff Distance for Spatial Data Analysis

The HGP

The \(v(\mathbf{s})\) function

Ensuring the process’ validity

PD and Empirical Assessment

The Powered Exponential Correlation (PEC) Function

Visualizing the PEC function

Empirical Assessment of PEC’s Positive Definiteness

HGP Recap

Spatial modeling

Spatial GLMM

Bayesian Inference & Model Assessment

Priors

Bayesian Modeling Recap

Areal Data Application

Respiratory Disease Hospitalization in Glasgow

Respiratory Disease Hospitalization in Glasgow

Disease Mapping: Estimation & GOF

Disease Mapping: Spatial Dependence

Disease Mapping: Adjusted SIR

Data Fusion Application

Air Pollution in Ventura and Los Angeles

AGP Approximation

Air Pollution: Estimation and Prediction Assessment

Air Pollution: Change-of-Support

Applications: key findings

Closing remarks

Highlights

Future work and Limitations

References

Thank you!