Why you should default to Gaussian processes for random spatial and temporal effects

Author
Published

October 9, 2024

Introduction

It is common to use random effects to capture heterogeneity in data that have been collected at spatial and/or temporal intervals. For example, counts of species or individuals are frequently replicated in space and time and researchers may add site or survey date as random effects in the model. These random effects are typically modeled as being normally distributed. Consider surveys \(i \in 1:I\) where counts \(y_i\) are made which we model with Poisson regression. The model with normally distributed random survey effects on the log expected count looks like this:

\[ \begin{aligned} y_i &\sim \textrm{Poisson} \left( \exp(\alpha + \epsilon_i) \right) \\ \epsilon_i &\sim \textrm{Normal} \left( 0, \tau \right) \end{aligned} \tag{1}\]

where the intercept \(\alpha\), the random survey-level offsets \(\epsilon_i\), and the standard deviation of the survey effects \(\tau\) are to be estimated by the model. In this post, I argue that when the random effects are not expected to be independent it almost always makes more sense to model the random effects using Gaussian processes (GPs).

GPs are a bit complicated (Simpson 2021) but one of their salient features is that any set of observations generated by a Gaussian process are marginally distributed as multivariate normal \(\textrm{Normal} (\boldsymbol{\mu}, \Sigma)\).1 In practice, the realisations of the GP are added to our linear predictor so \(\boldsymbol{\mu} = \boldsymbol{0}\). There are quite a few covariance kernels \(\Sigma\) to choose from but one of the most common is the exponentiated quadratic kernel given by \[ \Sigma_{ij} = \tau^2 \exp \left( -\frac{\lVert \boldsymbol{x}_i - \boldsymbol{x}_j \rVert^2}{2 \rho ^2} \right), \tag{2}\]

where \(\tau^2\) is the marginal variance (analogous to variance \(\tau^2\) with normally distributed random effects), \(\rho\) is the length-scale governing the covariance between observations, and the term \(\lVert \boldsymbol{x}_i - \boldsymbol{x}_j \rVert\) is the Euclidean distance between any two points of our input.2 The length-scale \(\rho\) governs the correlation between points as a function of the distance between them. When \(\rho\) is small, there is little correlation between neighbouring locations and when \(\rho\) is large, there is high correlation between observations. Essentially, GPs allow modeling normally distributed random effects where observations close to each other (as in, surveys conducted at the same time of year) may be correlated.

Example: tadpole counts

For this example, I’ll use data we collected on tadpoles of Fleay’s barred frogs (Mixophyes fleayi), a stream-dwelling species endemic to the Gondwana rainforests of northern New South Wales and southeast Queensland, Australia (Hollanders et al. 2024). During my PhD, I swept a dipnet through pools in the creeks for roughly 1 min every six weeks, deposited the tadpoles in a tub, took a photograph from above, and used Photoshop to count and measure the tadpoles.3 Here, I’ll fit the model in Equation 1 with and without GP to model the number of tadpoles I captured during each six-weekly survey at one of the sites. First I’ll download the data from my GitHub and plot the counts for the creek using ggplot2 (Wickham 2016) and other packages from the tidyverse (Wickham et al. 2019).

Code
# download data
library(tidyverse)
dat <- read_csv("https://raw.githubusercontent.com/mhollanders/mfleayi-tadpoles/main/data/tadpole-lengths.csv") |> 
  filter(site == "brindle") |> 
  mutate(date = dmy(date)) |> 
  count(date)

# plot the counts
fig <- dat |> 
  ggplot(aes(date, n)) + 
  geom_point(colour = green, shape = 16, size = 3, alpha = 3/4) + 
  scale_y_continuous(breaks = seq(50, 200, 50), limits = c(0, 225), expand = c(0, 0)) + 
  labs(x = "Survey", y = "Number of tadpoles")
fig

Fleay’s barred frogs breed from the early spring to the middle of summer and generally there are several pulses of reproduction resulting in fresh cohorts of tadpoles. We’ll ignore this and just model the counts using random survey effects in the absence of any predictors.

Fleay’s barred frog (Mixophyes fleayi) from Brindle Creek, Border Ranges National Park, Australia.

Setting priors for GPs

GPs have two parameters that require priors, \(\tau\) and \(\rho\). The prior for \(\tau\) can just be something like an exponential prior commonly used for standard deviations of random effects.4 The prior for the length-scale \(\rho\) is trickier as it is often not well identified by the data. In fact, the data cannot inform length-scales that are outside of the distances observed in the data. This means that we want a prior than can suppress values below and above certain thresholds. One popular distribution for length-scales that’s capable of doing exactly this is the inverse gamma distribution. Michael Betancourt’s GP case study discusses this at length and includes a function that uses Stan’s algebra solver to determine the shape and scale of the inverse gamma distribution that places a certain proportion of the probability mass between two values.5

If the study was well designed to estimate the length-scale of the GP, the survey intervals should be shorter than the minimum length-scale we think we might be dealing with. Unfortunately, I did not collect data with this question in mind, and I can easily see that the relevant length-scale is shorter than the minimum length between surveys (I surveyed at six-weekly intervals, give or take). However, for this demonstration, we’ll assume that the survey intervals were rigorously chosen and we’ll use the minimum and maximum observed temporal distances between surveys to choose a suitable inverse gamma prior for the length-scale. So, I compute the minimum and maximum distance and chose tail probabilities of 1% so that 98% of the probability mass of the length-scale falls between the minimum and maximum observed distances between the surveys.

Fitting the model

Below I write out a Stan program (Carpenter et al. 2017) that uses an indicator to use either normally distributed random effects or to use a GP with a exponentiated quadratic covariance kernel. Stan includes the handy function gp_exp_quad_cov() that generates this covariance kernel from the input (survey times expressed in weeks, in this case) and standard deviation and length-scale. We also use this GP input to pre-compute the distance matrix to get the minimum and maximum observed (temporal) distances of the surveys to use Stan’s algebra solver to determine a suitable inverse gamma prior for \(\rho\). For both types of random effects I use the non-centered parameterisation which generally has better sampling for few observations (we only have 10 data points). For GPs (and multivariate normals more generally), this entails Cholesky decomposing the covariance matrix (think of it as a matrix square root) and multiplying it by a vector of standard normal variates. Note that before Cholesky decomposing the covariance kernel, I add a diagonal matrix with a very small value (called the “nugget” or “jitter”) to ensure the matrix is positive definite.

Code
functions {
  // function from Michael Betancourt for inverse gamma prior shape and scale
  vector inv_gamma_prior(vector guess, vector theta, array[] real tails, array[] int x_i) {
    vector[2] params;
    params[1] = inv_gamma_cdf(theta[1] | guess[1], guess[2]) - tails[1];
    params[2] = inv_gamma_cdf(theta[2] | guess[1], guess[2]) - tails[2];
    return params;
  }
}

data {
  int<lower=1> I;  // number of surveys
  array[I] real survey;  // survey input as number of weeks
  vector[2] min_max_dist;  // minimum and maximum observed distance between weeks
  array[I] int y;  // tadpole counts per survey
  int<lower=0, upper=1> GP;  // indicator for GP (0 = no, 1 = yes)
}

transformed data {
  vector<lower=0>[2] rho_prior = ones_vector(2);  // length-scale prior shape and scale
  if (GP) {
    rho_prior = algebra_solver(inv_gamma_prior, [1, 1]', min_max_dist, {0.01, 0.99}, {0});
    print("rho_prior: ", rho_prior);
  }
}

parameters {
  real alpha;  // intercept
  real<lower=0> tau, rho;  // random effect SD and GP length-scale
  vector[I] z;  // standard-normal z-scores for non-centered parameterisation
}

transformed parameters {
  vector[I] epsilon;  // random survey effects
  if (GP) {
    epsilon = cholesky_decompose(add_diag(gp_exp_quad_cov(survey, tau, rho), 1e-9)) * z;
  } else {
    epsilon = tau * z;
  }
}

model {
  // priors
  alpha ~ normal(0, 5);
  tau ~ exponential(1);
  rho ~ inv_gamma(rho_prior[1], rho_prior[2]);
  z ~ std_normal();
    
  // likelihood
  y ~ poisson_log(alpha + epsilon);
}

generated quantities {
  vector[I] yrep, log_lik;
  for (i in 1:I) {
    yrep[i] = poisson_log_rng(alpha + epsilon[i]);
    log_lik[i] = poisson_log_lpmf(y[i] | alpha + epsilon[i]);
  }
}

I’ll fit both the normal and GP model with CmdStanR (Gabry et al. 2024) in R 4.4.1 (R Core Team 2024).

Code
# prepare data for Stan
stan_data <- list(I = nrow(dat), 
                  survey = as.numeric(dat$date) / 7, 
                  y = dat$n, 
                  GP = 0)

# get distance matrix and add minimum and maximum observed distances
dist_obs <- dist(stan_data$survey) |> as.matrix()
stan_data$min_max_dist <- range(dist_obs[dist_obs > 0])

# normal model
fit_normal <- mod$sample(data = stan_data, 
                         refresh = 0, chains = 1, 
                         iter_warmup = 500, iter_sampling = 4000, 
                         show_exceptions = F)
Running MCMC with 1 chain...

Chain 1 finished in 0.3 seconds.
Code
# GP model
stan_data$GP <- 1
fit_gp <- mod$sample(data = stan_data, 
                     refresh = 0, chains = 1, 
                     iter_warmup = 500, iter_sampling = 4000, 
                     show_exceptions = F)
Running MCMC with 1 chain...

Chain 1 rho_prior: [3.58235,38.8276] 
Chain 1 finished in 1.4 seconds.

I’ll first plot the inverse gamma prior for the length-scale that was generated by the solver (green), alongside the posterior distribution of \(\rho\) (black) and the minimum and maximum observed temporal distances of the surveys (dashed lines), using the distributional (O’Hara-Wild et al. 2024) and ggdist and tidybayes packages (Kay 2024a, 2024b). The length-scale seems to want to be smaller than the prior allows, which is due to me enforcing an informative prior. Recall that I pretended the survey intervals were chosen in a principled manner to estimate this parameter by specifying a prior implying the length-scale really doesn’t be much shorter than six weeks.

Code
# packages
library(distributional)
library(ggdist)
library(tidybayes)

# plot
tibble(prior = dist_inverse_gamma(3.58, scale = 38.83)) |> 
  ggplot(aes(xdist = prior)) + 
  geom_vline(xintercept = stan_data$min_max_dist, 
             linetype = "dashed", linewidth = 1/2, colour = "#333333") +
  stat_slab(fill = green, alpha = 2/3) + 
  stat_slab(aes(xdist = rho), 
            data = spread_rvars(fit_gp, rho), 
            fill = "#333333", alpha = 2/3) + 
  scale_x_continuous(breaks = seq(0, 60, 20), limits = c(0, 70), expand = c(0, 0)) + 
  scale_y_continuous(breaks = NULL, expand = c(0, 0)) + 
  theme(panel.border = element_rect(colour = NA), 
        axis.line.x = element_line(colour = "#333333")) + 
  labs(x = expression(rho), y = NULL)

Focusing now on the posterior draws for the random survey effects \(\epsilon_i\), it’s clear that the predictions between the normal and GP model are very similar.

Code
# add predictions to figure
library(tidybayes)
fig + 
  stat_pointinterval(aes(ydist = exp(alpha + epsilon), 
                         shape = factor(model) |> fct_rev()), 
                     data = fit_normal |> 
                       spread_rvars(alpha, epsilon[i]) |> 
                       mutate(model = "Normal") |> 
                       bind_cols(dat) |> 
                       bind_rows(
                         fit_gp |> 
                           spread_rvars(alpha, epsilon[i]) |> 
                           mutate(model = "GP") |> 
                           bind_cols(dat)
                       ), 
                     position = position_dodge(width = 40),
                     point_interval = "median_hdci", .width = 0.95, 
                     size = 1, linewidth = 1/2) + 
  scale_colour_manual(values = c("#333333", green)) + 
  labs(shape = "Model")

We can also see that the standard deviation of the normal model is very similar to the marginal standard deviation of the GP model, albeit with more uncertainty. To be honest, the difference in the estimates of \(\tau\) here may be driven more by our prior on \(\rho\) that may be a bit more informative than it should be.

Code
fit_normal |> 
  spread_rvars(tau) |> 
  mutate(model = "Normal") |> 
  bind_rows(fit_gp |> 
              spread_rvars(tau) |> 
              mutate(model = "GP")) |> 
  ggplot(aes(factor(model) |> fct_rev(), ydist = tau)) +
  stat_pointinterval(point_interval = "median_hdci", .width = 0.95, 
                     size = 1/2, linewidth = 1/2) + 
  scale_colour_manual(values = c("#333333", "red4")) + 
  labs(x = "Model", y = expression(paste("Posterior distribution of ", tau)))

Plotting predictions

It can be hard to interpret the parameters of the GP directly so I will make some predictions over unobserved surveys in weekly intervals. Making GP predictions involves linear algebra that I don’t pretend to fully understand. Below is some code I modified from Nicholas Clark’s excellent blog post on spatial GPs.

Code
# predicted dates and distance matrix
pred_dates <- seq.Date(min(dat$date) - weeks(2), max(dat$date) + weeks(2), 7)
pred <- as.numeric(pred_dates) / 7
dist_pred <- dist(pred) |> as.matrix()

# distance matrix between observed and predicted dates
dist_obs_to_pred <- proxy::dist(stan_data$survey, pred)

# squared exponential kernel function
gp_exp_quad_cov <- function(dist, tau, rho, nugget = 1e-9) {
  return(tau^2 * exp(-0.5 * (dist / rho)^2)) + diag(nugget, dim(dist)[1])
}

# function from Nicholas Clark
predict_gp <- function(tau, rho, epsilon, dist_obs, dist_pred, dist_obs_to_pred, nugget = 1e-9) {
  
  # estimated covariance kernel for the fitted points
  K_obs <- gp_exp_quad_cov(dist_obs, tau, rho)
  
  # estimated covariance kernel for prediction points
  K_pred <- gp_exp_quad_cov(dist_pred, tau, rho)
  
  # cross-covariance between fitted and prediction points
  K_new <- gp_exp_quad_cov(dist_obs_to_pred, tau, rho)
  
  # number of prediction points
  n_pred <- dim(dist_pred)[1]
  
  # posterior draw for prediction points
  return(t(K_new) %*% solve(K_obs, epsilon) + 
           t(chol(K_pred - t(K_new) %*% solve(K_obs, K_new) + diag(nugget, n_pred))) %*% rnorm(n_pred))
}

Next I extract the posterior draws of the model parameters to use the predict_gp() function for each draw to capture uncertainty. Then I plot the predictions over our existing figure. The GP was able to capture higher tadpole counts in the warmer months when reproduction is occurring, and that the winter and spring months prior to the first pulse of reproduction (July–October 2020) had low tadpole numbers. It’s possible that a different covariance kernel would work better, such as a periodic covariance kernel, but I won’t go into that here.

Code
# posterior draws
alpha <- spread_draws(fit_gp, alpha)$alpha
tau <- spread_draws(fit_gp, tau)$tau
rho <- spread_draws(fit_gp, rho)$rho
epsilon <- spread_draws(fit_gp, epsilon[i])
n_draws <- length(alpha)

# make predictions
pred <- map(.x = 1:n_draws, 
            .f = ~(tibble(
              .draw = .x, 
              date = pred_dates, 
              alpha = alpha[.x], 
              gp_f = predict_gp(
                tau = tau[.x], 
                rho = rho[.x], 
                epsilon = filter(epsilon, .draw == .x) |> pull(epsilon), 
                dist_obs = dist_obs, 
                dist_pred = dist_pred, 
                dist_obs_to_pred = dist_obs_to_pred
              )[, 1]
            ))) |> 
  list_rbind()

# add to figure
fig + 
  stat_lineribbon(aes(y = exp(alpha + gp_f)), 
                  data = pred, 
                  point_interval = "median_qi", 
                  .width = 0.9, 
                  colour = green, 
                  fill = green, 
                  alpha = 1/3) + 
  scale_x_date(expand = c(0, 0))

Conclusion

Using GPs to model random effects makes a lot of sense if there is some interest in modeling correlation between temporally or spatially organised data. Given that switching out the normally distributed random effects for a GP is trivial in modern probabilistic programming languages such as Stan, it makes sense to default to using them, especially as it includes involves just one extra parameter, the length-scale \(\rho\).

References

Carpenter, B., A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker, J. Guo, P. Li, and A. Riddell. 2017. Stan: A Probabilistic Programming Language. Journal of Statistical Software 76:1–32.
Gabry, J., R. Češnovar, A. Johnson, and S. Bronder. 2024. cmdstanr: R Interface to “CmdStan.” manual.
Hollanders, M., L. F. Grogan, H. I. McCallum, and D. A. Newell. 2024. High chytrid prevalence and infection intensities in tadpoles of Mixophyes fleayi. Wildlife Research 51.
Kay, M. 2024a. tidybayes: Tidy data and geoms for Bayesian models. manual.
Kay, M. 2024b. ggdist: Visualizations of distributions and uncertainty. manual.
O’Hara-Wild, M., M. Kay, and A. Hayes. 2024. distributional: Vectorised probability distributions. manual.
R Core Team. 2024. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
Simpson, D. 2021, November 3. Yes but what is a Gaussian process? Or, Once, twice, three times a definition; or A descent into madness. https://dansblog.netlify.app/yes-but-what-is-a-gaussian-process-or-once-twice-three-times-a-definition-or-a-descent-into-madness.
Wickham, H. 2016. ggplot2: Elegant graphics for data analysis. Springer-Verlag New York.
Wickham, H., M. Averick, J. Bryan, W. Chang, L. D. McGowan, R. François, G. Grolemund, A. Hayes, L. Henry, J. Hester, M. Kuhn, T. L. Pedersen, E. Miller, S. M. Bache, K. Müller, J. Ooms, D. Robinson, D. P. Seidel, V. Spinu, K. Takahashi, D. Vaughan, C. Wilke, K. Woo, and H. Yutani. 2019. Welcome to the tidyverse. Journal of Open Source Software 4:1686.

Footnotes

  1. Dan Simpson’s blog is a great place to learn more about GPs.↩︎

  2. When our input is multidimensional, such as coordinates for spatial data, we are still working with the Euclidean distance between vectors.↩︎

  3. I also did other things with those tadpoles, but I won’t go into that here.↩︎

  4. This also happens to the penalised complexity (PC) prior for this parameter.↩︎

  5. The case study is a great place to learn more about GPs more generally.↩︎