HAUSDORFF–GAUSSIAN PROCESS WITH A SPATIOTEMPORAL APPLICATION
Available at lcgodoy.me/slides/2024-ness/
2024-05-23
“Everything is related to everything else, but near things are more related than distant things.” —Tobler’s first law of geography
Taking into account spatial dependence possibly present in data is a foremost aspect of spatial statistics.
Statistical inference depends heavily on the spatial structure/geometry of the observed spatial data.
Geometry | Branch |
---|---|
Point-referrenced | Geostatistics |
Point patterns | |
Areal units/polygons | Lattice/Areal models |
Mixed | Spatial Data Fusion |
The model below is flexible enough to analyze most spatial data structures (Cressie 1993)
\[\{ Z(\mathbf{s}) \; : \; \mathbf{s} \in D \},\]
where \(D\) is an index set.
Areal models: index set is countable.
Geostatistics: index set is a continuum.
Gaussian Process (GP): a stochastic process such that all its finite-dimensional marginal distributions are multivariate normal distributions.
The main research questions we are interested in are:
Can we propose a model for spatial data that accomodates areal, point-referrenced, and fused data?
If so, is this model competitive when compared to specialized models?
Proposal: A GP defined on a flexible index set.
Main challenge: Defining a valid correlation function.
Metric space: \((D, d)\), where \(D\) is a spatial region of interest.
Distance between a point and a set: \(d(x, A) = \inf_{a \in A} d(x, a)\), where \(d(x, y)\) is the distance between any two elements \(x, y \in D\).
Directed Hausdorff distance: \[{\vec h}(A, B) = \sup_{a \in A} d(a, B)\]
Hausdorff distance: \[h(A, B) = \max \left \{ \vec{h}(A, B), \vec{h}(B, A) \right \}\]
Study region: In spatial statistics, \(D\) is tipically a closed and bounded subset of \(\mathbb{R}^2\).
In this context, the Hausdorff distance is a metric (Sendov 2004).
Isotropic stochastic process: \(\{ Z(\mathbf{s}) \, : \, \mathbf{s} \in \mathcal{B}(D) \},\) where \(D \subset \mathbb{R}^2\).
Flexible index set: \(\mathcal{B}(D)\) contains all closed and bounded subsets of \(D\).
The HGP: The stochastic process \(Z(\cdot)\) is an HGP if:
It is a GP
Its correlation function is a function of the Hausdorff distance between sets (in our case, spatial units).
This process is suitable to analyze point-referrenced, areal, and mixed spatial data.
For a valid process, its correlation function must satisfy the following properties:
Diminish with increasing distance: \(\lim_{h \to \infty}r(h) = 0\).
Bounded from above by 1: \(r(0) = 1\).
Positive-definiteness: yields positive-definite correlation matrices for all its finite-dimensional marginal distributions.
Unfortunately, functions that are guaranteed to be positive definite on \((\lVert \cdot \rVert_2, \mathbb{R}^2)\) are not necessarily positive definite on other metric spaces (Li et al. 2023).
PEC function: \[r(h; \phi, \nu) = \exp\left \{ - \frac{h^{\nu}}{\phi^{\nu}}\right \},\] where \(\nu\) is a smoothness parameter and \(\phi\) governs the range of the spatial dependence.
Parametrization: We reparametrize this function with \(\rho = {\log(10)}^{1 / \nu} \phi\).
Interpretation: \(\rho\) is the distance at which the spatial correlation reduces to \(0.10\).
Flexibility: A process that handles point-referrenced, areal, and mixed spatial data by construction.
Hausdorff distance: Enables HGP’s correlation function to account for the shape, size, and orientation of spatial objects.
Validity: Using a PEC function ensures the HGP is a valid process.
Set of spatial locations: \(\mathcal{S} = \{ \mathbf{s}_1, \ldots, \mathbf{s}_n \}\), where \(\mathbf{s}_i \in \mathcal{B}(D)\) and \(D \subset \mathbb{R}^2\).
Response variable: \(\mathbf{Y} = {(Y(\mathbf{s}_1), \ldots, Y(\mathbf{s}_n))}^\top\).
Covariates: \(\mathbf{X} = (\mathbf{X}_1, \ldots, \mathbf{X}_p)\), where \(\mathbf{X}_{j} = (x_{j1}, \ldots, x_{jn})^\top\) and \(\mathbf{x}_i = (x_{1i}, \ldots, x_{pi})\).
Spatial random effects: \(\mathbf{Z} = {(Z(\mathbf{s}_1), \ldots, Z(\mathbf{s}_n))}^\top\).
Density functions (for response and latent variables): \(p(\cdot \mid \cdot)\)
Prior and posterior distributions: \(\pi(\cdot)\) and \(\pi(\cdot \mid \cdot)\)
A generalized linear mixed model (GLMM) can be written as \[\begin{aligned} & Y(\mathbf{s}_i) \mid \mathbf{x}_i, z(\mathbf{s}_i) \overset{{\rm ind}}{\sim} f(\cdot \mid \mu_i, \omega) \\ & g(\mu_i) = \mathbf{x}_i \boldsymbol{\beta} + z(\mathbf{s}_i). \end{aligned}\]
Probability distribution: \(f(\cdot)\)
Link function: \(g(\cdot)\)
Conditional mean: \(\mu_i = \mathbb{E}[Y(\mathbf{s}_i) \mid \mathbf{x}_i, z(\mathbf{s}_i)]\)
Model parameters: \(\theta = {\{\boldsymbol{\beta}^\top, \boldsymbol{\sigma}^\top, \delta^\top, \omega^\top \}}^\top\)
Joint density: \(p(\mathbf{y} \mid \mathbf{z}, \theta) = \prod_{i = 1}^n f(y(\mathbf{s}_i) \mid \mu_i, \omega)\)
Priors: \[\begin{align*} & \mathbf{Z} \sim \mathrm{HGP}\{0, v(\cdot), r(\cdot) \} \\ & \boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0}, 10^2 \mathbf{I}) \quad \sigma \sim t_{+}(3) \\ & \rho \sim \mathrm{Exp}(a_\rho) \quad \nu \sim \mathrm{Beta}(a_\nu, b_\nu) \end{align*}\]
Posterior: \(\pi(\theta \mid \mathbf{y}, \mathbf{z}) \propto p(\mathbf{y} \mid \mathbf{z}, \theta) p(\mathbf{z} \mid \theta) \pi(\theta)\)
MCMC sampler: No-U-Turn (Homan and Gelman 2014).
Convergence assessment: traceplots and split-\({\hat{R}}\) (Vehtari et al. 2021).
Goodness-of-fit criteria: DIC, WAIC, and LOOIC (lower values indicate better fit)
Unobserved outcome at \(m\) locations: \(\mathbf{Y}^\ast = (y(\mathbf{s}^\ast_1), \ldots, y(\mathbf{s}^\ast_m))^\top\).
Posterior predictive distributions: \(p(\mathbf{y}^{\ast} \mid \mathbf{y}) = \int p(\mathbf{y}^{\ast} \mid \mathbf{z}^{\ast}, \theta) p(\mathbf{z}^{\ast} \mid \mathbf{z}, \theta) \pi(\theta \mid \mathbf{y}) {\rm d} \theta\)
Monte Carlo samples from \(p(\mathbf{y}^{\ast} \mid \mathbf{y})\) are obtained using the draws from the posterior \(\pi(\theta \mid \mathbf{y})\)
Predictions assessment: Interval Score (IS) and RMSP (lower values indicate better fit)
TB in Context: Tuberculosis (TB) remains a major global health threat, second in infectious disease mortality only to COVID-19.
Brazilian TB Trends: Brazil’s national TB incidence declined only slightly from 2006–2015. Rio Grande do Sul (RS) reported significantly higher incidence than the national average in 2021, with the eastern region even more affected.
Spatial Dependence: Studies demonstrate strong spatial dependence of TB infections (and co-infections) in Brazil, both nationwide and in specific cities. Time dependence has been largely overlooked in the literature.
Risk Factors: TB risk factors include densely populated areas, poverty, substance abuse, and incarceration.
Sample units: 54 municipalities, across 11 years (2011 to 2021). We use 2022 to assess the quality of predictions.
Number of TB cases: \(Y_t(\mathbf{s}_i) = Y(\mathbf{s}_i, t)\) at location \(\mathbf{s}_i\) and time \(t\).
Population: \(P_t(\mathbf{s}_i)\).
Five covariates and two way interactions between presence of prison and continuous variables (except IDESE).
\[\begin{aligned} & (Y_t(\mathbf{s}_i) \mid \mathbf{X}_{t}(\mathbf{s}_i), z_t(\mathbf{s}_i)) \overset{{\rm ind}}{\sim} \text{Poisson}(P_t(\mathbf{s}_i) \mu_{it}) \\ & \log(\mu_{it}) = \beta_0 + \mathbf{X}^\top_{t}(\mathbf{s}_i) \beta + z_t(\mathbf{s}_i) \end{aligned}\]
No random effects
Unstructured: Random intercept per municipality (iid Normal)
Time dependent: \(\mathrm{AR}(1)\)
Spatial: Homoscedastic HGP
Separable spatiotemporal: \(\mathrm{AR}(1)\) + Homoscedastic HGP, Multivariate \(\mathrm{AR}(1)\) with HGP errors
Separable spatiotemporal and unstructured: Multivariate \(\mathrm{AR}(1)\) with HGP errors + random intercept per municipality
LOOIC | WAIC | DIC | RMSP | IS | |
---|---|---|---|---|---|
BYM | 3606.1 | 3449.9 | 3505.4 | 123.3 | 176.6 |
DAGAR | 3520.9 | 3447.5 | 3443.6 | 22.4 | 88.8 |
HGP | 3516.1 | 3447.7 | 3450.4 | 21.1 | 87.8 |
Parameter | Estimate |
---|---|
\(\exp(\beta_1)\) | 2.34 (1.70, 3.19) |
\(\exp(\beta_2)\) | 1.33 (1.15, 1.56) |
\(\exp(\beta_3)\) | 1.03 (0.99, 1.07) |
\(\exp(\beta_4)\) | 0.97 (0.93, 1.00) |
\(\exp(\beta_5)\) | 0.99 (0.92, 1.07) |
\(\exp(\beta_2 + \beta_{21})\) | 1.75 (1.18, 2.52) |
\(\exp(\beta_3 + \beta_{31})\) | 2.25 (1.63, 3.09) |
\(\exp(\beta_4 + \beta_{41})\) | 2.51 (1.83, 3.46) |
Tailored an HGP extension for spatiotemporal disease mapping.
Competitive with specialized models
It helps to gain insights into spatiotemporal disease mapping through spatiotemporal correlation functions.
More reliable estimates of risk factors
Out-of-sample predictions to inform public policies
The HGP offers a powerful and versatile model for spatial analysis. Advantages include:
Common framework for diverse spatial data types.
Competitive performance when compared to specialized models.
Directly interpretable spatial parameters.
Reliable and accurate predictions.
Out-of-sample predictions: Non-trivial, it may be necessary to refit the model.
Interpretability: Relationship between the spatial dependence parameter and correlation between neighbors is counterintuitive.
Heteroscedasticity: Marginal SD depend on the number of neighbors.
Extensions for change of support and data fusion problems are cumbersome and there exists no consensus in the literature in how to approach such problems.
Computing: Although satisfactory approximations exist, it becomes computationally prohibitive for moderately large datasets.
Approximations: Inferences may drastically change depending on the precision of the approximations.
Biases: It is hard to quantify the impact of the discretizations on inferences (Gonçalves and Gamerman 2018).
\(Z_k \mid Z_{-k} \sim \mathcal{N}\left (\frac{\psi} {m_k} \sum_{i \sim k} Z_i, \frac{\tau^2}{m_k} \right)\)
Special case: ICAR.
Extensions of the ICAR model, providing a measure of spatial dependence have been developed (Dean et al. 2001; Leroux et al. 2000).
The directed acyclic graph auto-regressive (DAGAR) model (Datta et al. 2019) proposes a new way to construct precision matrices using a DAG derived from the original adjacency matrix.
GP at the point-referrenced level and a stochastic integral at the areal level (AGP) (Moraga et al. 2017): \[Z(\mathbf{s}_i) = \begin{cases} {\mathcal{A}(\mathbf{s}_i)}^{-1} \int_{\mathbf{s}_i} Z(\mathbf{s}) \mathrm{d}\mathbf{s}, & \text{if } \mathcal{A}(\mathbf{s}_i) > 0, \\ Z(\mathbf{s}_i), & \; \text{otherwise,} \end{cases}\]
Covariance (Johnson et al. 2020): \[\mathrm{Cov}(Z(\mathbf{s}_i), Z(\mathbf{s}_j)) = {(\mathcal{A}(\mathbf{s}_i) \mathcal{A}(\mathbf{s}_j))}^{-1} \int_{\mathbf{s}_j} \int_{\mathbf{s}_i} C(\mathbf{x}, \mathbf{y}) \mathrm{d} \mathbf{x} \mathrm{d} \mathbf{y}\]
Arbitrary precision: The stochastic integrals are evaluated numerically and depend on a user-defined grid.
Proposition
Let \(h = h(\mathbf{s}, \mathbf{s}')\) be the Hausdorff distance between two spatial units, denoted \(\mathbf{s}, \mathbf{s}' \in \mathcal{B}(D)\), where \(D \subset \mathbb{R}^2\). The powered exponential correlation function \(\exp \{ - h^{\nu} / \phi^{\nu} \}\) is positive definite for \(\nu \in (1/2, 1)\).
The proposition above guarantees the validity of the HGP equipped with a PEC function with \(\nu \in (1/2, 1)\).
The proof is based on embedding the Hausdorff distance into a high-dimensional \(L_1\) normed Euclidean space, and using the fact that the exponential correlation function is positive definite on this space.
Goal: Evaluate HGP’s ability to adapt to different scenarios and compare its performance to specialized models.
200 simulated datasets for each combination of data generation process and scenario.
Estimation assessment: MAPE, RMSE, frequentist coverage of the credible interval (CP).
Predictions assessment: bias, RMSP, interval score (Gneiting and Raftery 2007, IS), frequentist coverate of the prediction intervals (CPP).
Data model: Poisson GLMM, single covariate, log link function, and a spatial random effect.
We fit Poisson GLMMs with a homoscedastic HGP prior on the random effect.
Assessment:
Data Generation:
Competing Method:
Generating data from HGP:
Regression parameter under misspecification:
HGP outperforms DAGAR when simulating data from a AGP random effect.
Data model: Gaussian GLMM, no covariates, identity link function, and a spatial random effect.
We fit the models with a heteroscedastic HGP as prior on the random effect.
Competing method: AGP1 and AGP2 priors on the random effect. The latter employes a finer grid/mesh resolution to approximate integrals numerically.
Parameters estimation when generating data from HGP random effects.
Model comparison: Predictions assessment.
The model employs a Matérn covariance function (at point level) with smoothness parameter fixed at 1. The other two parameters are the SD \(\sigma\) and the practical range \(\rho\). Both are analogous to the parameters with the same name in the HGP model.
Inferences are conducted using the INLA (Rue et al. 2009) and the stochastic partial differential equation (Lindgren et al. 2011, SPDE) method.
The HGP inherits same problems from geostatistical models.
There are no weakly consistent estimators for the variance and spatial dependence parameters (Zhang 2004)
Predictions are still optimal (Zhang 2004)
The small scale variation parameter had a low coverage in the credible intervals (\(79.5\%\) for scenario 2).
The MAPE for the \(\sigma\), \(\sigma_a\), and \(\rho\) parameters fluctuated around \(10\%\) in both scenarios.
HGP as an AGP Approximation: The HGP delivers satisfactory approximations to an AGP model.
HGP Advantages: The HGP offers potential computational advantages for large datasets while maintaining clear parameter interpretation.
Flexibility & Interpretability: Models diverse spatial data types and provides meaningful spatial parameters, regardless of unit size or shape.
Predictions: The HGP delivers accurate predictions
Areal Data: The model delivers valuable insights on the spatial dependence and accounts for sample units’ size, shape, and orientation.
Data Fusion: The HGP simplifies the modeling process eliminating the need of a user-defined grid or mesh. Consequently, avoids computationally intensive numerical integrations with arbitrary precision, making inference seamless.