Often, in an applied setting, it is interesting to compute estimates (and quantify its associated uncertainty) of socioeconomic and demographic variables for different administrative regions within the same city. Computing estimates is trivial when applying the Areal Interpolation (Goodchild and Lam 1980) approach. However, a common criticism regarding this method is that it does not provide uncertainty measures associated with the estimates (Bradley, Wikle, and Holan 2016). Our package provides the simple areal interpolation along with an approximation and an upper bound for the variance of the computed estimates when the variance associated with the observed data is known, which is a common fact when working with survey data. To illustrate how does it work, we are going to show how to estimate a variable observed at census tracts in the Public Use Microdata Areas (PUMAs) in NYC.
To load the data and the packages needed for this tutorial run the chunk of code below.
library(sf)
#> Linking to GEOS 3.10.2, GDAL 3.4.1, PROJ 8.2.1; sf_use_s2() is TRUE
library(smile)
library(ggplot2)
data(nyc_surv)
data(nyc_comd)
Since the nyc_surv
data proves a “margin of error”
instead of a variance, we are going to use the piece of code below to
approximate the variance. We are assuming the margin of error was
calculated using a 95% confidence interval based on the Normal
distribution.
nyc_surv <- transform(nyc_surv,
my_var = moe / qnorm(p = .975))
nyc_surv <- transform(nyc_surv, my_var = my_var * my_var)
Now, to estimate a variable observed at the nyc_surv
dataset into the nyc_comd
data we can run the following
command:
estimate_comd <-
ai(source = nyc_surv, target = nyc_comd,
vars = "estimate")
where source
is the “source” dataset, i.e., the one on
which we observe the variable we wish to estimate at the
target
dataset. The argument vars
is a
character
(scalar or vector) containing the name of the
variable (or variables) at the source
that are going to be
estimated at the target
regions. This function does not
quantify the uncertainty associated with the estimates, though. In order
to get a variability measure associated with the estimates, we need to
use the function ai_var
(see the chunk of code below). The
function is quite similar to the before mentioned ai
. The
differences are: 1) vars
must be a scalar; 2) the argument
my_var
is another length 1 character
indicating the name of the variable that contains the variance of the
observed variable in the source
dataset; and 3) the
argument var_method
indicates whether we are going to use
the Moran’s I (var_method = "MI"
) to approximate the
autocorrelation between areas and, therefore, approximate the variance
of the estimates or the Cauchy–Schwartz (var_method = "CS"
)
inequality to calculate an upper bound for such variances. The output of
the ai_var
function is the target
(sf data.frame
) with two additional columns,
est
and se_est
, representing the estimated
variable and its corresponding (approximate) standard error.
estimate_comd <-
ai_var(source = nyc_surv, target = nyc_comd,
vars = "estimate", vars_var = "my_var",
var_method = "MI")
The next piece of code uses ggplot2
to plot the observed
and estimated variable at NYC census tracts and community districts,
respectively.
viz_dt <-
rbind(
transform(nyc_surv, source = "Observed", est = estimate)[c("source", "est")],
transform(estimate_comd, source = "Predicted")[c("source", "est")]
)
ggplot(data = viz_dt,
aes(fill = est / 1000)) +
geom_sf(color = 1, lwd = .1) +
scale_fill_viridis_c(option = "H",
name = "median \n income",
label = \(x) paste0(x, "K")) +
theme_bw() +
facet_wrap( ~ source) +
theme(axis.text = element_blank(),
axis.ticks = element_blank())