HAUSDORFF–GAUSSIAN PROCESS WITH A SPATIOTEMPORAL APPLICATION

Available at lcgodoy.me/slides/2024-ness/

Lucas da Cunha Godoy

Department of Statistics, UConn

NESS 2024

2024-05-23

Introduction

“Everything is related to everything else, but near things are more related than distant things.” —Tobler’s first law of geography

Spatial Data

  • Taking into account spatial dependence possibly present in data is a foremost aspect of spatial statistics.

  • Statistical inference depends heavily on the spatial structure/geometry of the observed spatial data.

Geometry Branch
Point-referrenced Geostatistics
Point patterns
Areal units/polygons Lattice/Areal models
Mixed Spatial Data Fusion

General Spatial Model

The model below is flexible enough to analyze most spatial data structures (Cressie 1993)

\[\{ Z(\mathbf{s}) \; : \; \mathbf{s} \in D \},\]

where \(D\) is an index set.

  • Areal models: index set is countable.

  • Geostatistics: index set is a continuum.

  • Gaussian Process (GP): a stochastic process such that all its finite-dimensional marginal distributions are multivariate normal distributions.

Research Questions

  • The main research questions we are interested in are:

    1. Can we propose a model for spatial data that accomodates areal, point-referrenced, and fused data?

    2. If so, is this model competitive when compared to specialized models?

  • Proposal: A GP defined on a flexible index set.

  • Main challenge: Defining a valid correlation function.

Hausdorff–Gaussian Process (HGP)

Distances between Sets

  • Metric space: \((D, d)\), where \(D\) is a spatial region of interest.

  • Distance between a point and a set: \(d(x, A) = \inf_{a \in A} d(x, a)\), where \(d(x, y)\) is the distance between any two elements \(x, y \in D\).

  • Directed Hausdorff distance: \[{\vec h}(A, B) = \sup_{a \in A} d(a, B)\]

  • Hausdorff distance: \[h(A, B) = \max \left \{ \vec{h}(A, B), \vec{h}(B, A) \right \}\]

Hausdorff Distance for Spatial Data Analysis

  • Study region: In spatial statistics, \(D\) is tipically a closed and bounded subset of \(\mathbb{R}^2\).

  • In this context, the Hausdorff distance is a metric (Sendov 2004).

The HGP

  • Isotropic stochastic process: \(\{ Z(\mathbf{s}) \, : \, \mathbf{s} \in \mathcal{B}(D) \},\) where \(D \subset \mathbb{R}^2\).

  • Flexible index set: \(\mathcal{B}(D)\) contains all closed and bounded subsets of \(D\).

  • The HGP: The stochastic process \(Z(\cdot)\) is an HGP if:

    • It is a GP

    • Its correlation function is a function of the Hausdorff distance between sets (in our case, spatial units).

  • This process is suitable to analyze point-referrenced, areal, and mixed spatial data.

Ensuring Validity: Correlation Functions for the HGP

For a valid process, its correlation function must satisfy the following properties:

  • Diminish with increasing distance: \(\lim_{h \to \infty}r(h) = 0\).

  • Bounded from above by 1: \(r(0) = 1\).

  • Positive-definiteness: yields positive-definite correlation matrices for all its finite-dimensional marginal distributions.

  • Unfortunately, functions that are guaranteed to be positive definite on \((\lVert \cdot \rVert_2, \mathbb{R}^2)\) are not necessarily positive definite on other metric spaces (Li et al. 2023).

The Powered Exponential Correlation (PEC) Function

  • PEC function: \[r(h; \phi, \nu) = \exp\left \{ - \frac{h^{\nu}}{\phi^{\nu}}\right \},\] where \(\nu\) is a smoothness parameter and \(\phi\) governs the range of the spatial dependence.

  • Parametrization: We reparametrize this function with \(\rho = {\log(10)}^{1 / \nu} \phi\).

  • Interpretation: \(\rho\) is the distance at which the spatial correlation reduces to \(0.10\).

Visualizing the PEC function

HGP Recap

  • Flexibility: A process that handles point-referrenced, areal, and mixed spatial data by construction.

  • Hausdorff distance: Enables HGP’s correlation function to account for the shape, size, and orientation of spatial objects.

  • Validity: Using a PEC function ensures the HGP is a valid process.

flowchart LR
    A(Flexible index set) ==> B(GP)
    B ==> E(((HGP)))
    C(Hausdorff distance) ==> D(PEC)
    D ==> E

Spatial modeling

Notation

  • Set of spatial locations: \(\mathcal{S} = \{ \mathbf{s}_1, \ldots, \mathbf{s}_n \}\), where \(\mathbf{s}_i \in \mathcal{B}(D)\) and \(D \subset \mathbb{R}^2\).

  • Response variable: \(\mathbf{Y} = {(Y(\mathbf{s}_1), \ldots, Y(\mathbf{s}_n))}^\top\).

  • Covariates: \(\mathbf{X} = (\mathbf{X}_1, \ldots, \mathbf{X}_p)\), where \(\mathbf{X}_{j} = (x_{j1}, \ldots, x_{jn})^\top\) and \(\mathbf{x}_i = (x_{1i}, \ldots, x_{pi})\).

  • Spatial random effects: \(\mathbf{Z} = {(Z(\mathbf{s}_1), \ldots, Z(\mathbf{s}_n))}^\top\).

  • Density functions (for response and latent variables): \(p(\cdot \mid \cdot)\)

  • Prior and posterior distributions: \(\pi(\cdot)\) and \(\pi(\cdot \mid \cdot)\)

Spatial GLMM

A generalized linear mixed model (GLMM) can be written as \[\begin{aligned} & Y(\mathbf{s}_i) \mid \mathbf{x}_i, z(\mathbf{s}_i) \overset{{\rm ind}}{\sim} f(\cdot \mid \mu_i, \omega) \\ & g(\mu_i) = \mathbf{x}_i \boldsymbol{\beta} + z(\mathbf{s}_i). \end{aligned}\]

  • Probability distribution: \(f(\cdot)\)

  • Link function: \(g(\cdot)\)

  • Conditional mean: \(\mu_i = \mathbb{E}[Y(\mathbf{s}_i) \mid \mathbf{x}_i, z(\mathbf{s}_i)]\)

  • Model parameters: \(\theta = {\{\boldsymbol{\beta}^\top, \boldsymbol{\sigma}^\top, \delta^\top, \omega^\top \}}^\top\)

  • Joint density: \(p(\mathbf{y} \mid \mathbf{z}, \theta) = \prod_{i = 1}^n f(y(\mathbf{s}_i) \mid \mu_i, \omega)\)

Bayesian Estimation

  • Priors: \[\begin{align*} & \mathbf{Z} \sim \mathrm{HGP}\{0, v(\cdot), r(\cdot) \} \\ & \boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0}, 10^2 \mathbf{I}) \quad \sigma \sim t_{+}(3) \\ & \rho \sim \mathrm{Exp}(a_\rho) \quad \nu \sim \mathrm{Beta}(a_\nu, b_\nu) \end{align*}\]

  • Posterior: \(\pi(\theta \mid \mathbf{y}, \mathbf{z}) \propto p(\mathbf{y} \mid \mathbf{z}, \theta) p(\mathbf{z} \mid \theta) \pi(\theta)\)

  • MCMC sampler: No-U-Turn (Homan and Gelman 2014).

  • Convergence assessment: traceplots and split-\({\hat{R}}\) (Vehtari et al. 2021).

Goodness-of-fit and Predictions

  • Goodness-of-fit criteria: DIC, WAIC, and LOOIC (lower values indicate better fit)

  • Unobserved outcome at \(m\) locations: \(\mathbf{Y}^\ast = (y(\mathbf{s}^\ast_1), \ldots, y(\mathbf{s}^\ast_m))^\top\).

  • Posterior predictive distributions: \(p(\mathbf{y}^{\ast} \mid \mathbf{y}) = \int p(\mathbf{y}^{\ast} \mid \mathbf{z}^{\ast}, \theta) p(\mathbf{z}^{\ast} \mid \mathbf{z}, \theta) \pi(\theta \mid \mathbf{y}) {\rm d} \theta\)

  • Monte Carlo samples from \(p(\mathbf{y}^{\ast} \mid \mathbf{y})\) are obtained using the draws from the posterior \(\pi(\theta \mid \mathbf{y})\)

  • Predictions assessment: Interval Score (IS) and RMSP (lower values indicate better fit)

Spatiotemporal application

Tuberculosis in Brazil

  • TB in Context: Tuberculosis (TB) remains a major global health threat, second in infectious disease mortality only to COVID-19.

  • Brazilian TB Trends: Brazil’s national TB incidence declined only slightly from 2006–2015. Rio Grande do Sul (RS) reported significantly higher incidence than the national average in 2021, with the eastern region even more affected.

  • Spatial Dependence: Studies demonstrate strong spatial dependence of TB infections (and co-infections) in Brazil, both nationwide and in specific cities. Time dependence has been largely overlooked in the literature.

  • Risk Factors: TB risk factors include densely populated areas, poverty, substance abuse, and incarceration.

Objectives

  • Reliable incidence estimates:
    • TB incidence across the study period, incorporating location-specific factors.
    • Smaller municipalities benefit from “borrowed strength” from neighbors (in the Hausdorff distance sense).
    • Results enable the calculation of SIRs to pinpoint high-risk areas.
  • Forecasts:
    • TB incidence rates one year ahead.
    • These predictions offer crucial insights for proactive public health planning.

Model

  • Sample units: 54 municipalities, across 11 years (2011 to 2021). We use 2022 to assess the quality of predictions.

  • Number of TB cases: \(Y_t(\mathbf{s}_i) = Y(\mathbf{s}_i, t)\) at location \(\mathbf{s}_i\) and time \(t\).

  • Population: \(P_t(\mathbf{s}_i)\).

  • Five covariates and two way interactions between presence of prison and continuous variables (except IDESE).

\[\begin{aligned} & (Y_t(\mathbf{s}_i) \mid \mathbf{X}_{t}(\mathbf{s}_i), z_t(\mathbf{s}_i)) \overset{{\rm ind}}{\sim} \text{Poisson}(P_t(\mathbf{s}_i) \mu_{it}) \\ & \log(\mu_{it}) = \beta_0 + \mathbf{X}^\top_{t}(\mathbf{s}_i) \beta + z_t(\mathbf{s}_i) \end{aligned}\]

Spatiotemporal Trend

Explanatory Variables

Random Effects

  • No random effects

  • Unstructured: Random intercept per municipality (iid Normal)

  • Time dependent: \(\mathrm{AR}(1)\)

  • Spatial: Homoscedastic HGP

  • Separable spatiotemporal: \(\mathrm{AR}(1)\) + Homoscedastic HGP, Multivariate \(\mathrm{AR}(1)\) with HGP errors

  • Separable spatiotemporal and unstructured: Multivariate \(\mathrm{AR}(1)\) with HGP errors + random intercept per municipality

Results: GOF and Predictive Performance

LOOIC WAIC DIC RMSP IS
BYM 3606.1 3449.9 3505.4 123.3 176.6
DAGAR 3520.9 3447.5 3443.6 22.4 88.8
HGP 3516.1 3447.7 3450.4 21.1 87.8

Results: Relative Risks

Parameter Estimate
\(\exp(\beta_1)\) 2.34 (1.70, 3.19)
\(\exp(\beta_2)\) 1.33 (1.15, 1.56)
\(\exp(\beta_3)\) 1.03 (0.99, 1.07)
\(\exp(\beta_4)\) 0.97 (0.93, 1.00)
\(\exp(\beta_5)\) 0.99 (0.92, 1.07)
\(\exp(\beta_2 + \beta_{21})\) 1.75 (1.18, 2.52)
\(\exp(\beta_3 + \beta_{31})\) 2.25 (1.63, 3.09)
\(\exp(\beta_4 + \beta_{41})\) 2.51 (1.83, 3.46)

Spatiotemporal Dependence

Small Municipalities

Forecast

Population at Risk in 2022

Highlights

  • Tailored an HGP extension for spatiotemporal disease mapping.

  • Competitive with specialized models

  • It helps to gain insights into spatiotemporal disease mapping through spatiotemporal correlation functions.

  • More reliable estimates of risk factors

  • Out-of-sample predictions to inform public policies

Closing remarks

Conclusion

The HGP offers a powerful and versatile model for spatial analysis. Advantages include:

  • Common framework for diverse spatial data types.

  • Competitive performance when compared to specialized models.

  • Directly interpretable spatial parameters.

  • Reliable and accurate predictions.

Future work and Limitations

  • Extensions:
    • non-Euclidean spaces
    • Big data
  • Limitations:
    • Difficulties in obtaining spectral densities for correlation functions
    • Big “n” problem inherited from geostatistics
    • Unclear how to incorporate anisotropy

References

Besag, J., York, J., and Mollié, A. (1991), “Bayesian image restoration, with two applications in spatial statistics,” Annals of the Institute of Statistical Mathematics, 43, 1–20.
Cressie, N. (1993), Statistics for spatial data, Wiley series in probability and statistics, Wiley.
Datta, A., Banerjee, S., Hodges, J. S., and Gao, L. (2019), “Spatial disease mapping using directed acyclic graph auto-regressive (DAGAR) models,” Bayesian analysis, NIH Public Access, 14, 1221.
Dean, C., Ugarte, M., and Militino, A. (2001), “Detecting interaction between random region and fixed age effects in disease mapping,” Biometrics, Wiley Online Library, 57, 197–202.
Diggle, P. J., Tawn, J. A., and Moyeed, R. A. (1998), “Model-based geostatistics,” Journal of the Royal Statistical Society Series C: Applied Statistics, Oxford University Press, 47, 299–350.
Gneiting, T., and Raftery, A. E. (2007), “Strictly proper scoring rules, prediction, and estimation,” Journal of the American statistical Association, Taylor & Francis, 102, 359–378.
Gonçalves, F. B., and Gamerman, D. (2018), “Exact Bayesian inference in spatiotemporal Cox processes driven by multivariate Gaussian processes,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), Wiley Online Library, 80, 157–175.
Homan, M. D., and Gelman, A. (2014), “The No-U-turn sampler: Adaptively setting path lengths in hamiltonian Monte Carlo,” Journal of Machine Learning Research, JMLR.org, 15, 1593–1623.
Johnson, O., Diggle, P., and Giorgi, E. (2020), “Dealing with spatial misalignment to model the relationship between deprivation and life expectancy: A model-based geostatistical approach,” International Journal of Health Geographics, Springer, 19, 1–13.
Leroux, B. G., Lei, X., and Breslow, N. (2000), “Estimation of disease rates in small areas: A new mixed model for spatial dependence,” in Statistical models in epidemiology, the environment, and clinical trials, Springer, pp. 179–191.
Li, D., Tang, W., and Banerjee, S. (2023), “Inference for Gaussian processes with Matérn covariogram on compact Riemannian manifolds,” Journal of Machine Learning Research, 24, 1–26.
Lindgren, F., Rue, H., and Lindström, J. (2011), “An explicit link between Gaussian fields and Gaussian Markov random fields: The stochastic partial differential equation approach,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73, 423–498. https://doi.org/https://doi.org/10.1111/j.1467-9868.2011.00777.x.
Min, D., Zhilin, L., and Xiaoyong, C. (2007), “Extended Hausdorff distance for spatial objects in GIS,” International Journal of Geographical Information Science, Taylor & Francis, 21, 459–475.
Moraga, P., Cramb, S. M., Mengersen, K. L., and Pagano, M. (2017), “A geostatistical model for combined analysis of point-level and area-level data using INLA and SPDE,” Spatial Statistics, Elsevier, 21, 27–41.
Rue, H., Martino, S., and Chopin, N. (2009), “Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), Wiley Online Library, 71, 319–392.
Sendov, B. (2004), “Hausdorff distance and image processing,” Russian Mathematical Surveys, IOP Publishing, 59, 319. https://doi.org/10.1070/RM2004v059n02ABEH000721.
Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., and Bürkner, P.-C. (2021), “Rank-normalization, folding, and localization: An improved \(\hat{R}\) for assessing convergence of MCMC (with discussion),” Bayesian Analysis, International Society for Bayesian Analysis, 16, 667–718.
Zhang, H. (2004), “Inconsistent estimation and asymptotically equal interpolations in model-based geostatistics,” Journal of the American Statistical Association, Taylor & Francis, 99, 250–261.

Appendix

Areal Models: Limitations

  • Out-of-sample predictions: Non-trivial, it may be necessary to refit the model.

  • Interpretability: Relationship between the spatial dependence parameter and correlation between neighbors is counterintuitive.

  • Heteroscedasticity: Marginal SD depend on the number of neighbors.

  • Extensions for change of support and data fusion problems are cumbersome and there exists no consensus in the literature in how to approach such problems.

Data Fusion Models: Limitations

  • Computing: Although satisfactory approximations exist, it becomes computationally prohibitive for moderately large datasets.

  • Approximations: Inferences may drastically change depending on the precision of the approximations.

  • Biases: It is hard to quantify the impact of the discretizations on inferences (Gonçalves and Gamerman 2018).

Areal Data and the CAR Model

  • \(Z_k \mid Z_{-k} \sim \mathcal{N}\left (\frac{\psi} {m_k} \sum_{i \sim k} Z_i, \frac{\tau^2}{m_k} \right)\)

  • Special case: ICAR.

  • Extensions of the ICAR model, providing a measure of spatial dependence have been developed (Dean et al. 2001; Leroux et al. 2000).

  • The directed acyclic graph auto-regressive (DAGAR) model (Datta et al. 2019) proposes a new way to construct precision matrices using a DAG derived from the original adjacency matrix.

Fused Data

  • GP at the point-referrenced level and a stochastic integral at the areal level (AGP) (Moraga et al. 2017): \[Z(\mathbf{s}_i) = \begin{cases} {\mathcal{A}(\mathbf{s}_i)}^{-1} \int_{\mathbf{s}_i} Z(\mathbf{s}) \mathrm{d}\mathbf{s}, & \text{if } \mathcal{A}(\mathbf{s}_i) > 0, \\ Z(\mathbf{s}_i), & \; \text{otherwise,} \end{cases}\]

  • Covariance (Johnson et al. 2020): \[\mathrm{Cov}(Z(\mathbf{s}_i), Z(\mathbf{s}_j)) = {(\mathcal{A}(\mathbf{s}_i) \mathcal{A}(\mathbf{s}_j))}^{-1} \int_{\mathbf{s}_j} \int_{\mathbf{s}_i} C(\mathbf{x}, \mathbf{y}) \mathrm{d} \mathbf{x} \mathrm{d} \mathbf{y}\]

  • Arbitrary precision: The stochastic integrals are evaluated numerically and depend on a user-defined grid.

Theoretical Foundation: Positive Definiteness of the PEC

Proposition

Let \(h = h(\mathbf{s}, \mathbf{s}')\) be the Hausdorff distance between two spatial units, denoted \(\mathbf{s}, \mathbf{s}' \in \mathcal{B}(D)\), where \(D \subset \mathbb{R}^2\). The powered exponential correlation function \(\exp \{ - h^{\nu} / \phi^{\nu} \}\) is positive definite for \(\nu \in (1/2, 1)\).

  • The proposition above guarantees the validity of the HGP equipped with a PEC function with \(\nu \in (1/2, 1)\).

  • The proof is based on embedding the Hausdorff distance into a high-dimensional \(L_1\) normed Euclidean space, and using the fact that the exponential correlation function is positive definite on this space.

Practical Validation: Empirical Assessment of Positive Definiteness

Simulation Studies: Setup

  • Goal: Evaluate HGP’s ability to adapt to different scenarios and compare its performance to specialized models.

  • 200 simulated datasets for each combination of data generation process and scenario.

  • Estimation assessment: MAPE, RMSE, frequentist coverage of the credible interval (CP).

  • Predictions assessment: bias, RMSP, interval score (Gneiting and Raftery 2007, IS), frequentist coverate of the prediction intervals (CPP).

Simulation Studies: Scenarios

Simulation Study: Areal Data

  • Data model: Poisson GLMM, single covariate, log link function, and a spatial random effect.

  • We fit Poisson GLMMs with a homoscedastic HGP prior on the random effect.

  • Assessment:

    • Parameter estimation when generating data from HGP random effects.
    • How good are the estimates of the regression coefficient under random effect misspecification?
    • Goodness-of-fit comparison with DAGAR priors on the random effect.

Areal Random Effects

  • Data Generation:

    • Besag-York-Mollié (BYM, Besag et al. 1991): \(\mathbf{Z} \sim \mathcal{N}(\mathbf{0}, \sigma^{2}((1 - \zeta) \mathbf{I} + \zeta \mathbf{Q}^{-})\), where \(\mathbf{Q}\) is the precision matrix from the ICAR model and \(\zeta\) is a mixing parameter.
    • Aggregated GP (AGP): \(Z(\mathbf{s}_i) = {\mathcal{A}(\mathbf{s}_i)}^{-1} \int_{\mathbf{s}_i} K(\mathbf{s}) \mathrm{d} \mathbf{s}\), where \(\{ K(\mathbf{s}) \, : \, \mathbf{s} \in D \}\) is a (point-level) zero-mean, second order stationary and isotropic GP.
  • Competing Method:

    • DAGAR: \(\mathbf{Z} \sim \mathcal{N}(\mathbf{0}, \Lambda(\tau, \psi))\), where \(\Lambda\) is a precision matrix, \(\tau\) is the conditional precision, and \(\psi\) measures spatial dependence.

Areal Simulation: Estimation

  • Generating data from HGP:

    • CP of the hardest parameter to estimate (\(\rho\)) was \(92.5\%\) in the worst case scenario.
    • For all the other parameters, the coverage was extremely close to the nominal level.
  • Regression parameter under misspecification:

    • Largest (worst) MAPE was \(6.6\%\).
    • CP of the credible interval suggests reliable estimates.

Areal Simulation: GOF

HGP outperforms DAGAR when simulating data from a AGP random effect.

Simulation Study: Data Fusion

  • Data model: Gaussian GLMM, no covariates, identity link function, and a spatial random effect.

  • We fit the models with a heteroscedastic HGP as prior on the random effect.

  • Competing method: AGP1 and AGP2 priors on the random effect. The latter employes a finer grid/mesh resolution to approximate integrals numerically.

  • Parameters estimation when generating data from HGP random effects.

  • Model comparison: Predictions assessment.

Competing Method: Meshes

Competing Method: Further Details

  • The model employs a Matérn covariance function (at point level) with smoothness parameter fixed at 1. The other two parameters are the SD \(\sigma\) and the practical range \(\rho\). Both are analogous to the parameters with the same name in the HGP model.

  • Inferences are conducted using the INLA (Rue et al. 2009) and the stochastic partial differential equation (Lindgren et al. 2011, SPDE) method.

Data Fusion Simulation: Estimation

  • The HGP inherits same problems from geostatistical models.

  • There are no weakly consistent estimators for the variance and spatial dependence parameters (Zhang 2004)

  • Predictions are still optimal (Zhang 2004)

  • The small scale variation parameter had a low coverage in the credible intervals (\(79.5\%\) for scenario 2).

  • The MAPE for the \(\sigma\), \(\sigma_a\), and \(\rho\) parameters fluctuated around \(10\%\) in both scenarios.

Data Fusion Simulation: Prediction

  • Uncertainty of the predictions: The HGP handles uncertainty in its predictions better than the AGP1 model, even when the latter reflects the true data generation process.
  • HGP as an AGP Approximation: The HGP delivers satisfactory approximations to an AGP model.

  • HGP Advantages: The HGP offers potential computational advantages for large datasets while maintaining clear parameter interpretation.

Highlights

  • Flexibility & Interpretability: Models diverse spatial data types and provides meaningful spatial parameters, regardless of unit size or shape.

  • Predictions: The HGP delivers accurate predictions

  • Areal Data: The model delivers valuable insights on the spatial dependence and accounts for sample units’ size, shape, and orientation.

  • Data Fusion: The HGP simplifies the modeling process eliminating the need of a user-defined grid or mesh. Consequently, avoids computationally intensive numerical integrations with arbitrary precision, making inference seamless.