Acknowledgements

This research follows from my PhD thesis and is conducted in collaboration with:

  • Marcos O. Prates - Universidade Federal de Minas Gerais

  • Jun Yan - University of Connecticut

  • Fernando A. Quintana - Pontificia Universidad Católica de Chile

  • Bruno Sansó - University of California Santa Cruz

Introduction

RF in Spatial Statistics

  • A random field (RF): \(\{ Z(\mathbf{s}) \; : \; \mathbf{s} \in D \}\), where \(D\) is an index set.

  • RF are used extensively in spatial statistics (Cressie 1993), where sample units are conceptualized as elements of an index-set \(D\).

  • Inference based on RF relies on further assumptions. The usual assumptions depend heavily on the spatial structure/geometry of the observed spatial data.

Geometry Branch Index set
Areas/polygons Areal models Countable
Points Geostatistics Continuum

Spatial sample units in practice

Multiple Spatial Resolutions

  • Change of Support: Predicting a process on one spatial resolution (or scale) using data collected from a different resolution (Gelfand et al. 2001);

    • Downscaling: A specific type of change of support where coarse aggregate data is used to infer values at a finer resolution (Zheng et al. 2025).
  • Spatial Data Fusion: Analyzing the same phenomenon when observations are simultaneously available at multiple resolutions (Moraga et al. 2017).

  • Spatial Misalignment: Handling response and explanatory variables that are observed on different spatial resolutions (Godoy et al. 2026a).

Set-indexed Random Fields

  • A set-indexed random field \(Z(\mathbf{s})\) should be well defined when \(\mathbf{s} \in D\) is not a singleton.
    • \(\{Z(\mathbf{s}) \, : \, \mathbf{s}\) \(\subset\) \(D\}\)
  • Set-indexed by sets are not a completely new idea:
    • Itô calculus, Brownian sheet (Adler and Taylor 2007, Ch 1.4)
    • Stochastic integrals: \(Z(B) = v(B)^{-1} \int_B Z(\mathbf{x}) \mathrm{d} x\)

Typical Assumptions1

  • The assumptions regarding the index set \(D\) are inherited from Geostatistics.

  • Realizations observed over areal units (or blocks) are an aggregation of the point-level process \(Z(\mathbf{s})\): \[ Z(B) = {\lvert B \rvert}^{-1} \int_{B} Z(\mathbf{x}) \mathrm{d}\mathbf{x}, \]

  • Covariances involving aggregations are as follows: \[ \mathrm{Cov}[Z(B), Z(\mathbf{s})] = {\lvert B \rvert}^{-1} \int_{B} \mathrm{Cov}[Z(\mathbf{x}), Z(\mathbf{s})] \mathrm{d}\mathbf{x} \]

Limitations

  • Since no analytical solutions are available, Monte Carlo techniques are used to approximate the covariances (Gelfand et al. 2001): \[ \begin{align} \mathrm{Cov}[Z(B), Z(\mathbf{s})] & = {\lvert B \rvert}^{-1} \int_{B} \mathrm{Cov}[Z(\mathbf{x}), Z(\mathbf{s})] \mathrm{d}\mathbf{x} \\ & \approx L^{-1} \sum_{k} \mathrm{Cov}[Z(\mathbf{s}_k), Z(\mathbf{s})] \end{align} \]

  • There is no consensus in the literature about how to choose \(L\) and these approximations may introduce unquantifiable biases (Gonçalves and Gamerman 2018).

An Alternative

  • A more flexible index set: The class of non-empty, closed and bounded sets in \(D\) (denoted \(\mathcal{C}_D\)).

    • \(\{Z(\mathbf{s}) \, : \, \mathbf{s} \in \mathcal{C}_D\}\) (Godoy et al. 2026b).
    • Isotropic Gaussian Process (GP) are defined with their covariance function depending on a distance between sets.
  • Successful in practice: Competitive with areal models, often better than models for data fusion.

  • Lacking theoretical foundation: no formal proof of validity of covariance functions, does not allow for smooth covariance functions.

Research objectives

  • Derive a theoretically sound RF for spatial data, which allows for modeling areal, point-referenced, and mixed spatial data seamlessly.

  • To achieve our goal, we will:

    • Define an new “appropriate” distance for sets (i.e., elements of \(\mathcal{C}_D\)).
    • Review properties of covariance functions.
    • Prove that covariance functions equipped with proposed distance yield valid RFs.

A new distance function for sets

The Hausdorff distance (HD)

  • Definition \[ h(A_1, A_2) = \inf \{ r \geq 0 \, : \, A_1 \subseteq {\rm B}_r(A_2), A_2 \subseteq {\rm B}_r(A_1) \}, \] where \(A_1 \subset D\) and \(A_2 \subset D\) are two non-empty sets.

  • Intuition: given a reference metric space \((D, d)\), the Hausdorff distance quantifies the greatest distance one would have to travel from a point in one set to reach the other set.

  • Limitations: Computationally expensive to compute (Knauer et al. 2011), no results establishing positive-definite functions of this distance.

HD, Balls & Length spaces

  • Definition: A metric space \((D, d)\) is a length space if \(d(x, y)\) equals the infimum of the lengths of paths connecting \(x \in D\) and \(y \in D\) (Burago et al. 2001).

  • Property: Length spaces possess approximate midpoints.

  • Lemma: Let \((D, d)\) be a length space. Then balls expand linearly: \[ \rm B_{r}(\rm B_{k}(x)) = \rm B_{r + k}(x). \]

  • Consequence: \(h(\rm B_r(x), \rm B_k(y)) = d(x, y) + \lvert r - k \rvert\).

Minimum Enclosing Balls

  • We denote the smallest ball containing a set \(A \subset \mathcal{C}_D\) by \(\mathcal{B}(A)\).

    • Radius: \({\rm R}(A) = \inf_{x \in D} \inf \{ r \geq 0 : A \subset \rm B_r(x) \}\)

    • Set of centers: \(\mathcal{E}(A) = \{ x \in D : {\rm R}(x, A) = {\rm R}(A) \}.\)

    • Chebyshev center: \(c(A) \in \mathcal{E}(A)\)

  • \(\mathcal{B}(A)\) always exists for closed and bounded sets on normed metric spaces (Garkavi 1970) and on complete manifolds (such as sphere and torus) (Burago et al. 2001).

The ball-Hausdorff distance

  • Definition: Let \((D, d)\) be a length-space. Define \(\mathcal{C}_D\) as the class of non-empty, closed and bounded sets in \(D\). The ball-Hausdorff distance is defined as: \[ bh(A_1, A_2) = d(c(A_1), c(A_2)) + \lvert R(A_1) - R(A_2) \rvert, \] where \(A_1, A_2 \in \mathcal{C}_D\).

  • Applied context: The class \(\mathcal{C}_D\) encompasses most types of data we encounter in Spatial Statistics:

    • Areal and point-referenced data points are closed and bounded sets (both on \(\mathbb{R}^n\) and \(\mathbb{S}^{n - 1}\))

Additional results

  • Remark: On the real line, the ball-Hausdorff distance is equivalent to the Hausdorff distance.

  • Remark: If we use the \(\lVert \cdot \rVert_1\) distance for sets in \(\mathbb{R}^p\), the ball-Hausdorff distance can be isometrically embedded into \((\mathbb{R}^{p + 1}, \lVert \cdot \rVert_1)\).

  • Theorem: Let \((D, d)\) be a pseudometric space. An upper-bound for the ball-Hausdorff distance is given by: \[ bh(A_1, A_2) \leq d(c(A_1), c(A_2)) + \max \{R(A_1), R(A_2)\}, \] where \(A_1, A_2 \in \mathcal{C}_D\).

Comparison with Hausdorff distance

Time Comparisons

  • Orders of magnitude faster than the Hausdorff distance in the scenarios examined here (5x-186x times faster).

Valid isotropic covariance functions for sets

Covariance Functions

  • Covariance function (CF): \(K : D \times D \to \mathbb{R}_{+}\)

  • Isotropic CFs: \(K \{ d(\mathbf{s}_1, \mathbf{s}_2) \}\).

  • Positive definiteness (PD): The CF of a RF must satisfy: \(\sum_{i, j = 1}^{n} c_i c_j K\{d(\mathbf{s}_i, \mathbf{s}_j)\} \geq 0\).

  • Isotropic CF of the Euclidean distance: \(\Phi_p\) is the class of valid isotropic CF on \(\mathbb{R}^p\): \(\Phi_1 \supset \Phi_2 \supset \cdots \supset \Phi_{\infty} = \bigcap_{k = 1}^{\infty} \Phi_{k}\)

  • Notable members of \(\Phi_{\infty}\): Matérn and Powered Exponential.

CND functions & Embeddings

  • Let \((D, d)\) and \((D^\ast, d^\ast)\) denote metric spaces.

  • Isometric embedding: \(\phi \,: \, D \to D^\ast\) such that \(d^\ast(\phi(s), \phi(t)) = d(s, t)\), for any \(s, t \in D\) (Wells and Williams 1975).

  • Conditionally Negative Definite Function: \(g \, : \, D \times D \to \mathbb{R}_{+}\) satisfying: \[\sum_{i = 1}^{m} \sum_{j = 1}^{m} b_i b_j g(s_i, s_j) \leq 0, \, \sum b_i = 0.\] for \(s_1, \ldots, s_m \in D\).

Schoenberg’s Theorem

  • Theorem: A pseudometric space \((D, d)\) can be isometrically embedded in a Hilbert space if and only if \(d^2\) is CND.

  • Consequence: Let \((D, d)\) be a metric space such that \(d\) is a CND pseudometric. Then, any function belonging to the class \(\Phi_{\infty}\) is PD on \((D, d^{1/2})\).

Theorem: ball-Hausdorff distance properties

Let \((D, d)\) be a length space where the function \(d\) is CND. Define \(\mathcal{C}_D\) as the class of non-empty, bounded sets in \(D\). Then,

  1. \(bh(\cdot, \cdot)\) is CND,
  2. \((\mathcal{C}_D, \sqrt{bh})\) can be embedded in a Hilbert space.

Intuition

  1. Follows from the fact that CND functions form a convex cone.

  2. Follows from the Theorem on CND functions and embeddings.

Corollary: PEXP covariance function

Let \((D, d)\) be a length space where the function \(d\) is CND. Then, the Powered Exponential (PEXP) covariance function \[ K(h; \, \theta) = \sigma^2 \exp \left\{ - \left( \frac{h}{\phi} \right)^{\nu} \right\} \] is a valid family on \((\mathcal{C}_D, bh)\) for \(\nu \in (0, 1]\).

Corollary: Matérn covariance function

Let \((D, d)\) be a length space where the function \(d\) is CND. Then, the Matérn covariance function \[ K(h; \, \theta) = \sigma^2 \frac{1}{2^{\nu - 1}\Gamma(\nu)} {\left(\frac{h}{\phi}\right)}^{\nu} K_{\nu} \left( \frac{h}{\phi} \right) \] is a valid family on \((\mathcal{C}_D, \sqrt bh)\).

Recap

  • The ball-Hausdorff distance is based on the Hausdorff distance between minimum enclosing balls.

  • Its existence is guaranteed for most scenarios that are relevant in spatial statistics applications.

  • Conditions for the CND property of the distance and an algorithm for its computation have been proposed.

  • Rich families of covariance functions are readily available for the (element-wise) square-root of the ball-Hausdorff distance.

Do the same properties hold for the Hausdorff distance?

  • One way to assess whether a covariance function is not PD is as follows:
    1. Compute the distance matrix for your dataset (based on the distance function you are interested at)
    2. Compute the covariance matrices based on the covariance functions for a real datataset
    3. Compute the smallest eigenvalue (denoted \(\lambda_1\)) associated with those covariance matrices
    4. If \(\lambda_1 < 0\), we have strong evidence that the covariance function is not valid.

Assessment: Matérn (\(\nu = 2.5\))

Assessment: PEXP (\(\nu = 1\))

Atmospheric temperature application

The data

Marginal ECDFs

Model

  • Model: \((Y(\mathbf{s}_i) \mid X(\mathbf{s}_i), z(\mathbf{s}_i)) \sim \mathcal{N}(\alpha + \beta^\top X(\mathbf{s}_i) + z(\mathbf{s}_i), \tau^2)\).

  • \(\mathbf{z} \sim \mathrm{NNGP}(\mathbf{0}, K(\cdot, \cdot))\), where

    • \(K(\mathbf{s}_i, \mathbf{s}_j; \theta) = v(\mathbf{s}_i) v(\mathbf{s}_j) \exp \{ - bh(\mathbf{s}_i, \mathbf{s}_j) / \phi \}\)
    • \(v(\mathbf{s}) = \mathbb{1}(\lvert \mathbf{s} \rvert > 0) \sigma_a + \mathbb{1}(\lvert \mathbf{s} \rvert = 0) \sigma_p\)
    • \(1 / \phi \sim Exp(\lambda_\phi)\), \(\log (\sigma_k) \sim N(0, 1)\), \(\alpha \sim N(0, 1)\) , \(\beta \sim N(0, 1)\)
  • Inference using Stan

Physical vs Statistical Model

Change of Support

Conclusion

Summary

  • Contribution:
    • A new distance for (bounded) sets;
    • Computationally efficient (and feasible) algorithm for computing the proposed distance;
    • Proved that standard isotropic covariance families (e.g., Matérn1, powered exponential) remain valid in this generalized setting;
    • Valid set-indexed isotropic RF have the potential to simplify many statistical problems in spatial statistics.

Future Work & Limitations

  • The proposed distance is a pseudometric (\(bh(A, B) = 0\) does not imply \(A \equiv B\)). However, introducing a nugget effect alleviates that problem.

  • Defining cross-covariance functions in this context would be a huge deal for spatial misalignment!

  • Other topics in this context:

    • How to define stationarity (when not assuming isotropy)?
    • What about anisotropy?
    • Mean-square differentiability may require more general concepts of differentiation.

References

Adler, R. J., and Taylor, J. E. (2007), Random fields and geometry, New York, NY: Springer. https://doi.org/10.1007/978-0-387-48116-6.
Bridson, M. R., and Haefliger, A. (1999), Metric spaces of non-positive curvature, Berlin, Heidelberg: Springer. https://doi.org/10.1007/978-3-662-12494-9.
Burago, D., Burago, Y., and Ivanov, S. (2001), A course in metric geometry, Graduate studies in mathematics, Providence, RI: American Mathematical Society. https://doi.org/10.1090/gsm/033.
Cressie, N. (1993), Statistics for spatial data, Wiley series in probability and statistics, Wiley. https://doi.org/10.1002/9781119115151.
Garkavi, A. L. (1970), “The theory of best approximation in normed linear spaces,” in Mathematical analysis, ed. R. V. Gamkrelidze, Boston, MA: Springer US, pp. 83–150. https://doi.org/10.1007/978-1-4684-3303-6_2.
Gelfand, A. E., Zhu, L., and Carlin, B. P. (2001), “On the change of support problem for spatio-temporal data,” Biostatistics, Oxford University Press, 2, 31–45.
Godoy, L. da C., Prates, M. O., and Yan, J. (2026a), “Voronoi linkage between mismatching voting stations and census tracts in analyzing the 2018 brazilian presidential election data,” Spatial Statistics, 71, 100949. https://doi.org/10.1016/j.spasta.2025.100949.
Godoy, L. da C., Prates, M. O., and Yan, J. (2026b), “Statistical inferences and predictions for areal data and spatial data fusion with Hausdorff–Gaussian processes,” Journal of Agricultural, Biological and Environmental Statistics. https://doi.org/10.1007/s13253-025-00720-7.
Gonçalves, F. B., and Gamerman, D. (2018), “Exact Bayesian inference in spatiotemporal Cox processes driven by multivariate Gaussian processes,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), Wiley Online Library, 80, 157–175.
Knauer, C., Löffler, M., Scherfenberg, M., and Wolle, T. (2011), “The directed Hausdorff distance between imprecise point sets,” Theoretical Computer Science, Elsevier, 412, 4173–4186.
Moraga, P., Cramb, S. M., Mengersen, K. L., and Pagano, M. (2017), “A geostatistical model for combined analysis of point-level and area-level data using INLA and SPDE,” Spatial Statistics, Elsevier, 21, 27–41.
Wells, J. H., and Williams, L. R. (1975), Embeddings and extensions in analysis, Berlin, Heidelberg: Springer. https://doi.org/10.1007/978-3-642-66037-5.
Zheng, X., Cressie, N., Clarke, D. A., McGeoch, M. A., and Zammit-Mangion, A. (2025), “Spatial-statistical downscaling with uncertainty quantification in biodiversity modelling,” Methods in Ecology and Evolution, Wiley Online Library, 16, 837–853. https://doi.org/10.1111/2041-210X.14505.