On spatial statistics and sets
Department of Statistics – PUC Chile
2025-08-18
Taking into account spatial dependence possibly present in data is a foremost aspect of spatial statistics.
General spatial model (Cressie 1993): \(\{ Z(\mathbf{s}) \; : \; \mathbf{s} \in D \}\), where \(D\) is an index set.
The model of choice depends heavily on the spatial structure/geometry of the observed spatial data.
Geometry | Branch | Index set |
---|---|---|
Areas/polygons | Areal models | Countable |
Points | Geostatistics | Continuum |
Mixed | Spatial Data Fusion |
Areal data
Point-referenced data
Mixed/fused data
The main research questions we are interested in are:
Can we propose a model for spatial data that accomodates areal, point-referrenced, and fused data?
If so, is this model competitive when compared to specialized models?
Proposal: An isotropic GP defined on a flexible index set.
Main challenge: Defining a valid correlation function.
Gaussian process: A stochastic process such that all its finite-dimensional marginal distributions are Gaussian.
Completely characterized (probabilistically) by two functions:
A GP is isotropic if \(\mathrm{Cov}(Z(\mathbf{s}_1), Z(\mathbf{s}_2)) = \mathrm{C(d(\mathbf{s}_1, \mathbf{s}_2))}\), where \(d\) is a distance function.
Point-referenced and areal spatial units are (closed and bounded) sets.
We need to generalize distance between points to distance between sets.
Ideally, this distance should:
Metric space: \((D, d)\), where \(D\) is a spatial region of interest.
Distance between a point and a set: \(d(x, A) = \inf_{a \in A} d(x, a)\), where \(d(x, y)\) is the distance between any two elements \(x, y \in D\)
Directed Hausdorff distance: \[{\vec h}(A, B) = \sup_{a \in A} d(a, B)\]
Hausdorff distance: \[h(A, B) = \max \left \{ \vec{h}(A, B), \vec{h}(B, A) \right \}\]
Study region: In spatial statistics, \(D\) is tipically a non-empty, closed and bounded subset of \(\mathbb{R}^2\).
In this context, the Hausdorff distance is a metric (Appendix C, Molchanov 2005).
Index set: \(\mathcal{B}(D)\) represents the non-empty, closed and bounded subsets of \(D \subset \mathbb{R}^2\).
\(Z(\mathbf{s}) \sim \mathrm{HGP}\{m(\mathbf{s}), v(\mathbf{s}), r(h)\}\), where \(\mathbf{s} \in \mathcal{B}(D)\).
Mean function: \(m(\mathbf{s}) = \mathbb{E}[Z(\mathbf{s})]\)
SD function: \(v(\mathbf{s}) = \sqrt{\mathrm{Var}(Z(\mathbf{s}))}\)
Cov. function: \(\mathrm{Cov}(Z(\mathbf{s}_1), Z(\mathbf{s}_2)) = v(\mathbf{s}_1) v(\mathbf{s}_2) r(h(\mathbf{s}_1, \mathbf{s}_2))\)
Correlation function: \(r(h) = \mathrm{Cor}(Z(\mathbf{s}_1), Z(\mathbf{s}_2)),\) where \(h\) denotes the Hausdorff distance between \(\mathbf{s}_1, \mathbf{s}_2 \in \mathcal{B}(D)\).
We may defined: \[v(\mathbf{s}) = \exp \{ \alpha_0 + \alpha_1 w(\mathbf{s}) \},\] where \(w(\mathbf{s})\) is a covariate available for any \(\mathbf{s} \in \mathcal{B}(D)\).
Useful special cases:
Homoscedastic: \(w(\mathbf{s}) = 0\) (\(\therefore \sigma = \exp \{ \alpha_0 \}\))
Data Fusion: \(w(\mathbf{s}) = \mathbb{1}(\mathcal{A}(\mathbf{s}) > 0)\).
Area dependent: \(w(\mathbf{s}) = \mathcal{A}(\mathbf{s})\)
Although flexible, one has to be careful when choosing this function to ensure the process validity (Palacios and Steel 2006).
Parametrization: \(\rho = {\log(10)}^{1 / \nu} \phi\).
Interpretation: \(\rho\) is the distance at which the spatial correlation reduces to \(0.10\).
For a valid process, its correlation function must satisfy the following properties:
Diminish with increasing distance: \(\lim_{h \to \infty}r(h) = 0\).
Bounded from above by 1: \(r(0) = 1\).
Positive definiteness (PD): yields positive definite correlation matrices for all its finite-dimensional marginal distributions.
Unfortunately, functions that are guaranteed to be positive definite on \((\mathbb{R}^2, \lVert \cdot \rVert_2)\) are not necessarily positive definite on other metric spaces (Li et al. 2023).
Definition: A real function \(k\) on a metric space \((D, d)\) is PD if it is continuous and satisfies:
\[ \sum_{i = 1}^{m} \sum_{j = 1}^{m} a_i a_j k(\mathbf{s}_i, \mathbf{s}_j) \geq 0 \]
for real numbers \(a_1, \ldots, a_m\) and \(\mathbf{s}_1, \ldots, \mathbf{s}_m \in D\).
Necessary condition: All the eigenvalues of the resulting correlation matrix must be non-negative.
Empirical assessment: We can check for PD by examining the smallest eigenvalue of the correlation matrix across the parameter space.
Flexibility: A process that handles point-referrenced, areal, and mixed spatial data by construction.
Hausdorff distance: Enables HGP’s correlation function to account for the shape, size, and orientation of spatial objects.
PEC Function Validity: A valid HGP depends on selecting an appropriate PEC correlation function.
Empirical assessments: When analyzing a new dataset or trying a new correlation function, it is imperative to empirically assess the validity of the function.
A generalized linear mixed effects model (GLMM) can be written as \[\begin{aligned} & Y(\mathbf{s}_i) \mid \mathbf{x}_i, Z(\mathbf{s}_i) \overset{{\rm ind}}{\sim} f(\cdot \mid \mu_i, \boldsymbol{\gamma}) \\ & g(\mu_i) = \mathbf{x}_i \boldsymbol{\beta} + Z(\mathbf{s}_i). \end{aligned}\]
Probability distribution: \(f(\cdot)\)
Conditional mean: \(\mu_i = \mathbb{E}[Y(\mathbf{s}_i) \mid \mathbf{x}_i, Z(\mathbf{s}_i)]\)
Link function: \(g(\cdot)\)
Model parameters: \(\boldsymbol{\theta} = {\{\boldsymbol{\beta}^\top, \boldsymbol{\sigma}^\top, \boldsymbol{\delta}^\top, \boldsymbol{\gamma}^\top \}}^\top\)
Joint density: \(p(\mathbf{y} \mid \mathbf{z}, \boldsymbol{\theta}) = \prod_{i = 1}^n f(y(\mathbf{s}_i) \mid \mu_i, \boldsymbol{\gamma})\)
Posterior: \(\pi(\boldsymbol{\theta} \mid \mathbf{y}, \mathbf{z}) \propto p(\mathbf{y} \mid \mathbf{z}, \boldsymbol{\theta}) p(\mathbf{z} \mid \boldsymbol{\theta}) \pi(\boldsymbol{\theta})\)
MCMC sampler: No-U-Turn (Homan and Gelman 2014).
Convergence assessment: traceplots and split-\({\hat{R}}\) (Vehtari et al. 2021).
Goodness-of-fit criteria: LOOIC (lower values indicate better fit)
Posterior predictive distributions: \(p(\mathbf{y}^{\ast} \mid \mathbf{y})\)
Predictions assessment: Interval Score (IS) and RMSP (lower values indicate better fit)
HGP as a prior for the random effects disribution in a GLMM.
Bayesian inference through MCMC.
Uncertainty quantification of predictions through the posterior predictive distributions.
\[\begin{aligned} & (Y(\mathbf{s}_i) \mid x_i, z(\mathbf{s}_i)) \sim \text{Poisson}(E_i \mu_i) \\ & \log(\mu_i) = \beta_0 + x_i \beta_1 + z(\mathbf{s}_i) \end{aligned}\]
Sample units: 134 intermediate zones (IZ), where the \(i\)-th IZ is denoted \(\mathbf{s}_i\).
Number of hospitalizations: \(\mathbf{Y} = {(Y(\mathbf{s}_1), \ldots, Y(\mathbf{s}_n))}^\top\).
Expected number of hospitalizations based on the national age- and sex-standardized rates: \(E_i\).
Percentage of people classified as income deprived: \(x_i\).
HGP | DAGAR | |
---|---|---|
\(\beta_0\) | -0.21 (-0.268, -0.139) | -0.26 (-0.450, -0.137) |
\(\beta_1\) | 0.33 (0.284, 0.368) | 0.31 (0.258, 0.370) |
\(\sigma\) | 0.19 (0.155, 0.234) | 0.30 (0.218, 0.484) |
\(\rho\) | 2.25 (0.159, 6.948) | |
\(\psi\) | 0.43 (0.069, 0.827) | |
LOOIC | 1081.0 | 1081.9 |
\[(Y(\mathbf{s}_i) \mid z(\mathbf{s}_i)) \sim \mathcal{N}(\beta_0 + z(\mathbf{s}_i), \tau^2)\]
PM2.5: \(\mathbf{Y} = {(Y(\mathbf{s}_1), \ldots, Y(\mathbf{s}_n))}^\top\).
Point-referrenced data from 19 measurement stations available daily from 1999 to date;
Satellite-derived estimates (2010–2012) at 184 areal units.
HGP | \(\rm AGP_1\) | \(\rm AGP_2\) | |
---|---|---|---|
\(\beta\) | 5.61 (4.69, 6.45) | 6.22 (2.16, 10.10) | 6.19 (5.88, 6.48) |
\(\rho\) | 13.83 (7.82, 23.61) | 13.16 (5.14, 30.24) | 0.63 (0.46, 0.83) |
\(\tau\) | 0.18 (0.07, 0.31) | 1.39 (1.24, 1.56) | 0.54 (0.39, 0.72) |
\(\sigma\) | 3.85 (2.92, 5.18) | 1.70 (0.96, 2.91) | 2.41 (2.024, 2.84) |
\(\sigma_a\) | 1.24 (1.04, 1.51) | ||
RMSP | 1.05 | 1.45 | 1.64 |
Width | 3.57 | 2.53 | 9.67 |
CPP | 95.5 | 78.6 | 95.5 |
IS | 4.80 | 15.11 | 13.65 |
The proposed method consistently demonstrated performance comparable to specialized models tailored for areal and fused data.
Unlike traditional areal models, the HGP’s marginal variances are independent of the number of neighbors.
The HGP model simplifies data fusion by bypassing the need to define arbitrary grids for numerical integral evaluation, eliminating this step each time the joint probability distribution of the data and parameters is calculated.
Across both applications, the HGP provides an interpretable spatial dependence parameter and a spatial correlation function.
The HGP has proven to be a powerful model that offers:
Versatility: accomodates diverse spatial data types.
Performance: competitive against models designed for specific spatial data types.
Reliable predictions: prediction intervals with near nominal frequentist coverage.
Our conclusions are further supported by a comprehensive simulation study, detailed in our available preprint.
Radially positive-definite functions of the Hausdorff distance on non-Euclidean spaces(with Marcos Prates, Fernando Quintana and Jun Yan)
HGP, big data and INLA (with Marcos Prates and Jun Yan)
Nearest-neighbor mixture processes and the Hausdorff distance (with Bruno Sansó)
Neural-networks and spatiotemporal data fusion (with Lucas Michelin, Marcos Prates, and Abhi Datta)
Multiplier bootstrap and fast inference for auto-logistic models (with Yuan Huang and Jun Yan)
Review paper on sports analytics (with Greg J. Matthews and Jun Yan)
Combining Bayesian hierarchical models & population dynamics for species distribution models (with Malin Pinsky)
R
package to estimate risks in consumer automobile loans (with Jackson Lautier)
Non-stationary geostatistical models for stock assessment in fisheries (with James Thorson)
Non-linear mixed effect models to analyze plants heat-tolerance changes in face of climate change (with Georgia Hernández)
Organization of weekly meetings and extracurricular projects for students.
Co-advised a Master’s student in Brazil (Michelin et al. 2025)
Many papers as a result of my work as statistical consultant. Highlights: Andreyeva et al. (2024) and Sieger et al. (2025)
Independent normal priors for the regression coefficients: \(\boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0}, 10 \mathbf{I})\)
HGP prior for the latent random effects: \(\mathbf{Z} \sim \mathrm{HGP}\{0, v(\cdot), r(\cdot) \}\)
Exponential prior for the spatial dependence parameter: \(\rho \sim \mathrm{Exp}(a_\rho)\), where \(a_{\rho} = - \log(p_{\rho}) / \rho_0\).
Homoscedastic variance: \(v(\mathbf{s}) = \sqrt{\mathrm{Var}(Z(\mathbf{s}))} = \sigma \sim t_{+}(3)\)
Heteroscedastic HGP: \(\alpha \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\), where \(v(\mathbf{s}) = \exp \{ \alpha_0 + \sum_i \alpha_i w_i(\mathbf{s}) \}\).