flowchart LR A(Flexible index set) ==> B(GP) B ==> E(((HGP))) C(Hausdorff distance) ==> D(PEC) D ==> E
STATISTICAL INFERENCES AND PREDICTIONS FOR AREAL DATA AND SPATIAL DATA FUSION WITH HAUSDORFF-GAUSSIAN PROCESSES
Available at lcgodoy.me/slides/2025-ucsf/
EEB Department, UCSC
Department of Epidemiology & Biostatistics – UCSF
2025-03-12
Taking into account spatial dependence possibly present in data is a foremost aspect of spatial statistics.
General spatial model (Cressie 1993): \(\{ Z(\mathbf{s}) \; : \; \mathbf{s} \in D \}\), where \(D\) is an index set.
Statistical inference depends heavily on the spatial structure/geometry of the observed spatial data.
Geometry | Branch | Index set |
---|---|---|
Areas/polygons | Areal models | Countable |
Points | Geostatistics | Continuum |
Mixed | Spatial Data Fusion |
Areal data
Point-referenced data
Mixed/fused data
The main research questions we are interested in are:
Can we propose a model for spatial data that accomodates areal, point-referrenced, and fused data?
If so, is this model competitive when compared to specialized models?
Proposal: An isotropic GP defined on a flexible index set.
Main challenge: Defining a valid correlation function.
Point-referenced and areal spatial units are (closed and bounded) sets.
We need to generalize distance between points to distance between sets.
Ideally, this distance should:
Metric space: \((D, d)\), where \(D\) is a spatial region of interest.
Distance between a point and a set: \(d(x, A) = \inf_{a \in A} d(x, a)\), where \(d(x, y)\) is the distance between any two elements \(x, y \in D\)
Directed Hausdorff distance: \[{\vec h}(A, B) = \sup_{a \in A} d(a, B)\]
Hausdorff distance: \[h(A, B) = \max \left \{ \vec{h}(A, B), \vec{h}(B, A) \right \}\]
Study region: In spatial statistics, \(D\) is tipically a closed and bounded subset of \(\mathbb{R}^2\).
In this context, the Hausdorff distance is a metric (Appendix C, Molchanov 2005).
Index set: \(\mathcal{B}(D)\) represents the closed and bounded subsets of \(D \subset \mathbb{R}^2\).
\(Z(\mathbf{s}) \sim \mathrm{HGP}\{m(\mathbf{s}), v(\mathbf{s}), r(h)\}\), where \(\mathbf{s} \in \mathcal{B}(D)\).
Mean function: \(m(\mathbf{s}) = \mathbb{E}[Z(\mathbf{s})]\)
Covariance function: \(\mathrm{Cov}(Z(\mathbf{s}_1), Z(\mathbf{s}_2)) = v(\mathbf{s}_1) v(\mathbf{s}_2) r(h(\mathbf{s}_1, \mathbf{s}_2))\)
SD function: \(v(\mathbf{s}) = \sqrt{\mathrm{Var}(Z(\mathbf{s}))}\)
Correlation function: \(r(h) = \mathrm{Cor}(Z(\mathbf{s}_1), Z(\mathbf{s}_2)),\) where \(h\) denotes the Hausdorff distance between \(\mathbf{s}_1, \mathbf{s}_2 \in \mathcal{B}(D)\).
We may defined: \[v(\mathbf{s}) = \exp \{ \alpha_0 + \alpha_1 w(\mathbf{s}) \},\] where \(w(\mathbf{s})\) is a covariate available for any \(\mathbf{s} \in \mathcal{B}(D)\).
Useful special cases:
Homoscedastic: \(w(\mathbf{s}) = 0\) (consequence, \(\sigma = \exp \{ \alpha_0 \}\))
Data Fusion: \(w(\mathbf{s}) = \mathbb{1}(\mathcal{A}(\mathbf{s}) > 0)\).
Area dependent: \(w(\mathbf{s}) = \mathcal{A}(\mathbf{s})\)
Although flexible, one has to be careful when choosing this function to ensure the process validity (Palacios and Steel 2006).
For a valid process, its correlation function must satisfy the following properties:
Diminish with increasing distance: \(\lim_{h \to \infty}r(h) = 0\).
Bounded from above by 1: \(r(0) = 1\).
Positive definiteness (PD): yields positive definite correlation matrices for all its finite-dimensional marginal distributions.
Unfortunately, functions that are guaranteed to be positive definite on \((\mathbb{R}^2, \lVert \cdot \rVert_2)\) are not necessarily positive definite on other metric spaces (Li et al. 2023).
Definition: A real function \(k\) on a metric space \((D, d)\) is PD if it is continuous and satisfies:
\[ \sum_{i = 1}^{m} \sum_{j = 1}^{m} a_i a_j k(\mathbf{s}_i, \mathbf{s}_j) \geq 0 \]
for real numbers \(a_1, \ldots, a_m\) and \(\mathbf{s}_1, \ldots, \mathbf{s}_m \in D\).
Necessary condition: All the eigen values of the resulting correlation matrix must be non-negative.
Empirical assessment: We can check for PD by examining the smallest eigenvalue of the correlation matrix across the parameter space.
Parametrization: We reparametrize this function with \(\rho = {\log(10)}^{1 / \nu} \phi\).
Interpretation: \(\rho\) is the distance at which the spatial correlation reduces to \(0.10\).
Flexibility: A process that handles point-referrenced, areal, and mixed spatial data by construction.
Hausdorff distance: Enables HGP’s correlation function to account for the shape, size, and orientation of spatial objects.
PEC Function Validity: A valid HGP depends on selecting an appropriate PEC correlation function.
Empirical assessments: When analyzing a new dataset or trying a new correlation function, it is imperative to empirically assess the validity of the function.
flowchart LR A(Flexible index set) ==> B(GP) B ==> E(((HGP))) C(Hausdorff distance) ==> D(PEC) D ==> E
A generalized linear mixed effects model (GLMM) can be written as \[\begin{aligned} & Y(\mathbf{s}_i) \mid \mathbf{x}_i, Z(\mathbf{s}_i) \overset{{\rm ind}}{\sim} f(\cdot \mid \mu_i, \boldsymbol{\gamma}) \\ & g(\mu_i) = \mathbf{x}_i \boldsymbol{\beta} + Z(\mathbf{s}_i). \end{aligned}\]
Probability distribution: \(f(\cdot)\)
Conditional mean: \(\mu_i = \mathbb{E}[Y(\mathbf{s}_i) \mid \mathbf{x}_i, Z(\mathbf{s}_i)]\)
Link function: \(g(\cdot)\)
Model parameters: \(\boldsymbol{\theta} = {\{\boldsymbol{\beta}^\top, \boldsymbol{\sigma}^\top, \boldsymbol{\delta}^\top, \boldsymbol{\gamma}^\top \}}^\top\)
Joint density: \(p(\mathbf{y} \mid \mathbf{z}, \boldsymbol{\theta}) = \prod_{i = 1}^n f(y(\mathbf{s}_i) \mid \mu_i, \boldsymbol{\gamma})\)
Independent normal priors for the regression coefficients: \(\boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0}, 10 \mathbf{I})\)
HGP prior for the latent random effects: \(\mathbf{Z} \sim \mathrm{HGP}\{0, v(\cdot), r(\cdot) \}\)
Exponential prior for the spatial dependence parameter: \(\rho \sim \mathrm{Exp}(a_\rho)\), where \(a_{\rho} = - \log(p_{\rho}) / \rho_0\).
Homoscedastic variance: \(v(\mathbf{s}) = \sqrt{\mathrm{Var}(Z(\mathbf{s}))} = \sigma \sim t_{+}(3)\)
Heteroscedastic HGP: \(\alpha \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\), where \(v(\mathbf{s}) = \exp \{ \alpha_0 + \sum_i \alpha_i w_i(\mathbf{s}) \}\).
Posterior: \(\pi(\boldsymbol{\theta} \mid \mathbf{y}, \mathbf{z}) \propto p(\mathbf{y} \mid \mathbf{z}, \boldsymbol{\theta}) p(\mathbf{z} \mid \boldsymbol{\theta}) \pi(\boldsymbol{\theta})\)
MCMC sampler: No-U-Turn (Homan and Gelman 2014).
Convergence assessment: traceplots and split-\({\hat{R}}\) (Vehtari et al. 2021).
Goodness-of-fit criteria: LOOIC (lower values indicate better fit)
Posterior predictive distributions: \(p(\mathbf{y}^{\ast} \mid \mathbf{y})\)
Predictions assessment: Interval Score (IS) and RMSP (lower values indicate better fit)
HGP as a prior for the random effects disribution in a GLMM.
Bayesian inference through MCMC.
Uncertainty quantification of predictions through the posterior predictive distributions.
\[\begin{aligned} & (Y(\mathbf{s}_i) \mid x_i, z(\mathbf{s}_i)) \sim \text{Poisson}(E_i \mu_i) \\ & \log(\mu_i) = \beta_0 + x_i \beta_1 + z(\mathbf{s}_i) \end{aligned}\]
Sample units: 134 intermediate zones (IZ), where the \(i\)-th IZ is denoted \(\mathbf{s}_i\).
Number of hospitalizations: \(\mathbf{Y} = {(Y(\mathbf{s}_1), \ldots, Y(\mathbf{s}_n))}^\top\).
Expected number of hospitalizations based on the national age- and sex-standardized rates: \(E_i\).
Percentage of people classified as income deprived: \(x_i\).
HGP | DAGAR | |
---|---|---|
\(\beta_0\) | -0.21 (-0.268, -0.139) | -0.26 (-0.450, -0.137) |
\(\beta_1\) | 0.33 (0.284, 0.368) | 0.31 (0.258, 0.370) |
\(\sigma\) | 0.19 (0.155, 0.234) | 0.30 (0.218, 0.484) |
\(\rho\) | 2.25 (0.159, 6.948) | |
\(\psi\) | 0.43 (0.069, 0.827) | |
LOOIC | 1081.0 | 1081.9 |