Efficient Jolly-Seber modeling in Stan

Author

Published

March 6, 2025

Introduction

Individual-based mark-recapture methods, where individual animals are identified through unique markings or through implants like passive integrated transponder (PIT) tags, form the backbone of many demographic studies. The canonical mark-recapture model, the Cormack-Jolly-Seber (CJS) model, estimates apparent¹ individual survival probabilities while accounting for imperfect detection. Animal population sizes are often of interest which can be derived from mark-recapture models by summing the number of individuals that are estimated to be alive during a given survey. However, since CJS models condition on an individual’s first capture, the population sizes derived from these models always pertain to the subset of the population that was captured at least once during the study. In order to get unbiased population sizes, we also need to account for individuals that were alive in the study area but never captured—this is where Jolly-Seber models come in.

(Cormack-)Jolly-Seber models

I have already blogged about CJS models, but I will briefly explain the assumed data-generating process here. Consider individuals \(i \in 1:I\) that were ever captured during surveys \(j \in 1:J\). After the survey of an individual’s first capture \((f_i + 1):J\), the multistate formulation of the survival process is described by the following transition probability matrix (TPM), where the rows represent the alive state at survey \(j-1\) (where state 1 is alive and state 2 is dead) and the columns the state at survey \(j\), with \(\tau_j\) being the length of time between consecutive surveys:

\[ {P_z}_{[j]} = \begin{bmatrix} \phi^{\tau_j} & 1 - \phi^{\tau_j} \\ 0 & 1 \end{bmatrix} \tag{1}\]

Thus, individuals survive a survey interval with survival probability \(\phi^\tau\) and cannot come back from the dead with probability 1. Conditional on an individual being alive during the survey, they are recaptured with detection probability \(p\)², yielding an \(I \cdot J\) detection history \(\boldsymbol{Y}\) indicating whether individuals were detected during surveys or not (\(y_{ij} = 1\) indicates non-detection, \(y_{ij} = 2\) indicates detection). Survival and detection probabilities can be modeled flexibly at the level of the individual and survey. When we use the forward algorithm to compute the individual-level log likelihood in Stan (Carpenter et al. 2017), we can use the backward sampling algorithm to recover the \(I \cdot J\) matrix \(\boldsymbol{Z}\) of latent alive states and compute derived quantities, such as the number of alive individuals in each survey. As aforementioned, however, the usefulness of these quantities is limited because they provide no information about individuals that were never captured.

Jolly-Seber models provide a remedy by modeling the entry process, or how individuals are entering the study before being captured the first time. By explicitly modeling the entry process, we can accommodate individuals that may have been alive in the study area but were never captured, and can thus provide an unbiased estimate of population size. Consider we have a latent “superpopulation” \(N_\text{super}\) which is the total number of individuals that were ever alive in the area during the study period, of which only the subset \(I\) were ever captured during the study. We assume that \(\boldsymbol{B} = B_1, \dots, B_J\) individuals “entered” the study in discrete pulses during each survey with entry probabilities \(\boldsymbol{\gamma} = \gamma_1, \dots, \gamma_J\) so that \(\sum_{j=1}^J B_j = N_\text{super}\) and \(\boldsymbol{B} \sim \textrm{Multinomial} \left( N_\text{super}, \boldsymbol{\gamma}\right)\). This implies that each individual’s entry occasion \(b_j \sim \textrm{Categorical}\left(\boldsymbol{\gamma}\right)\).

Of course, we do not know \(N_\text{super}\), but we can estimate it by using data augmentation (Royle and Dorazio 2012). Since we know that the number of observed individuals \(I \leq N_\textrm{super}\), we can augment our observed capture history \(\boldsymbol{Y}\) with a suffuciently large \(I_\textrm{aug}\) capture histories of no detections. Some of these augmented individuals could have been in the study area and never captured, whereas others were pseudo-individuals that never existed. By modeling \(N_\text{super} \sim \textrm{Binomial}\left(I + I_\textrm{aug}, \psi\right)\), we can estimate \(N_\text{super}\) as well as the entry probabilities and latent survey-level population sizes. The inclusion probability \(\psi\) is purely a nuisance parameter because its posterior distribution will change depending on the somewhat arbitrary choice of \(I_\textrm{aug}\).³

Implementation in Stan

When using JAGS or NIMBLE, we would struggle to fit this model as we’d have to use a loop for (j in (b[i] + 1):J), where the b[i]s are discrete parameters which are not allowed in indexing.⁴ In Stan, since we already have to marginalise out all discrete parameters, it is more familiar to marginalise them out. Furthermore, in the absence of individual-level effects (e.g., covariates or random effects), all \(I_\textrm{aug}\) unobserved individuals have the same log likelihood which can be computed once and subsequently multiplied by \(I_\textrm{aug}\). Also note that we only have to marginalise the possible entry occasions for observed individual up to their first capture. Below is a Stan program that implements the model, including the backward sampling algorithm that’s required to recover the posterior distributions of the latent states which we use to estimate population sizes. I wrote the model in such a way that if \(I_\textrm{aug} = 0\), the model defaults to a CJS model, but otherwise the JS model is used. If \(I_\textrm{aug}\) is too small and the estimated \(N_\text{super} = I + I_\textrm{aug}\), warning messages pop up to alert the user they should increase \(I_\textrm{aug}\).

Since the first version of this model where I simply modeled the entry probabilities as drawn from an uninformative Dirichlet distribution, I have attempted to model the entry probabilities as a function of the time interval between surveys, just like for survival. The idea is to model the expected number of entries on the log scale \(\mu\), where for all surveys after the first one we have a time interval (there’s no time interval for the time up to the first survey). I also included survey-level covariates for the entry probabilities, which pertain to each of the \(J\) surveys. Although typically in Dirichlet regression we have to fix one of the outcomes ot 0, I’m not sure if this is strictly necessary. SBC seems to suggest that we can actually model covariates at the level of each survey. I am generating survey-level random variates gamma_z from an exponential distribution as this seems most appropriate as the Dirichlet can be considered a vector of normalised rates generated by \(\textrm{Gamma}(\alpha, 1)\) distributions.

Code

functions {
  /** 
   * Return the natural logarithm of the product of the elementwise 
   * exponentiation of a matrix and vector (log matrix multiplication)
   * 
   * @param A  Matrix
   * @param B  Vector
   * @return   log(exp(A) * exp(B))
   */
  vector log_prod_exp(matrix A, vector B) {
    int I = rows(A);
    int J = cols(A);
    matrix[J, I] A_tr = A';
    vector[I] C;
    for (i in 1:I) {
      C[i] = log_sum_exp(A_tr[:, i] + B);
    }
    return C;
  }
  
  /** 
   * Sample latent discrete parameters
   *
   * @param A  Vector of unnormalised probabilities
   * @return   Categorical outcome
   */
  int latent_rng(vector A) {
    return categorical_rng(exp(A - log_sum_exp(A)));
  }
}

data {
  int<lower=1> I, J;  // number of individuals and surveys
  array[I] int<lower=1, upper=J> f;  // first capture
  array[I] int<lower=f, upper=J> l;  // last capture
  vector<lower=0>[J - 1] tau;  // survey intervals
  int<lower=1> X;  // number of recruitment covariates
  matrix[J, X] x;  // recruitment covariates
  array[I, J] int<lower=1, upper=2> y;  // detection history
  int<lower=0> I_aug;  // number of augmented individuals
}

transformed data {
  int JS = I_aug > 0;
  vector[JS * J] gamma_prior = ones_vector(JS * J);
  vector[J - 1] tau_scl = tau / sum(tau) * J;
  vector[J - 1] log_tau_scl = log(tau_scl);
}

parameters {
  real<lower=0, upper=1> phi, p;
  vector<lower=0>[JS] gamma_a;
  vector[JS * X] gamma_b;
  vector<lower=0>[JS * (J - 1)] gamma_z;
  vector<lower=0, upper=1>[JS] psi;
}

transformed parameters {
  vector[JS * J] mu, gamma;
  if (JS) {
    mu = x * gamma_b;
    mu[2:J] += gamma_a[1] * log_tau_scl + log(gamma_z);
    gamma = softmax(mu);  // uninformative prior would be gamma ~ dirichlet(ones_vector(J))
  }
  real lprior = beta_lpdf([phi, p] | 1, 1);
  if (JS) {
    lprior += exponential_lpdf(gamma_a[1] | 1)
              + std_normal_lpdf(gamma_b)
              + exponential_lpdf(gamma_z | 1)
              + beta_lpdf(psi[1] | 1, 1);
  }
}

model {
  // priors
  target += lprior;
  
  // pre-compute log probabilities
  vector[J - 1] phi_tau = pow(phi, tau);
  vector[J - 1] log_phi = log(phi_tau), log1m_phi = log1m(phi_tau);
  real log_p = log(p), log1m_p = log1m(p), log_psi, log1m_psi;
  if (JS) {
    log_psi = log(psi[1]);
    log1m_psi = log1m(psi[1]);
    
    // increment inclusion probabilities of observed individuals
    target += log_psi * I;
  }
  
  // (log) TPMs
  array[J - 1] matrix[2, 2] P_z;
  for (j in 1:(J - 1)) {
    P_z[j] = [[ log_phi[j], log1m_phi[j] ],
              [ negative_infinity(), 0 ]];
  }
  matrix[2, 2] P_y = [[ log_p, log1m_p ],
                      [ 0, negative_infinity() ]];
                      
  // marginal log probabilities for entry occasions and forward algorithm
  vector[J] lp;
  vector[2] Omega;
  
  // forward algorithm for observed individuals
  for (i in 1:I) {
    
    // initialise for CJS and marginalise out entry occasions if JS
    Omega = [ 0, negative_infinity() ]';
    if (JS) {
      for (k in 1:f[i]) {
        lp[k] = categorical_lpmf(k | gamma) + P_y[1, y[i, k]];
        for (j in (k + 1):f[i]) {
          lp[k] += P_z[j - 1][1, 1] + P_y[1, y[i, j]];
        }
      }
      Omega[1] = log_sum_exp(lp[1:f[i]]);
    }
    
    // subsequent surveys
    for (j in (f[i] + 1):l[i]) {
      Omega[1] += P_z[j - 1][1, 1] + P_y[1, y[i, j]];
    }
    for (j in (l[i] + 1):J) {
      Omega = log_prod_exp(P_z[j - 1]', Omega) + P_y[:, 1];
    }
    target += log_sum_exp(Omega);
  }
  
  // log-likelihood for augmented individuals is the same for all
  if (JS) {
    for (k in 1:J) {
      Omega = [ categorical_lpmf(k | gamma) + P_y[1, 1], 
                negative_infinity() ]';
      for (j in (k + 1):J) {
        Omega = log_prod_exp(P_z[j - 1]', Omega)
                + P_y[:, 1];
      }
      lp[k] = log_sum_exp(Omega);
    }
    target += I_aug * log_sum_exp(log1m_psi,
                                  log_psi + log_sum_exp(lp));
  }
}

generated quantities {
  vector[I] log_lik = zeros_vector(I);  // only for observed individuals
  array[JS * (I + I_aug)] int b = zeros_int_array(JS * (I + I_aug));  // entry occasions
  array[JS * J] int N = zeros_int_array(JS * J), B = zeros_int_array(JS * J);  // population sizes and number of entries per survey
  int N_super = I;  // superpopulation (initialised as observed individuals)
  {
    array[I + I_aug, J] int z = rep_array(0, I + I_aug, J);  // latent states (0 = not yet entered, 1 = alive, 2 = dead)
    vector[J - 1] phi_tau = pow(phi, tau);
    vector[J - 1] log_phi = log(phi_tau), log1m_phi = log1m(phi_tau);
    real log_p = log(p), log1m_p = log1m(p), log_psi, log1m_psi;
    if (JS) {
      log_psi = log(psi[1]);
      log1m_psi = log1m(psi[1]);
      log_lik = rep_vector(log_psi, I);
    }
    array[J - 1] matrix[2, 2] P_z;
    for (j in 1:(J - 1)) {
      P_z[j] = [[ log_phi[j], log1m_phi[j] ],
                [ negative_infinity(), 0 ]];
    }
    matrix[2, 2] P_y = [[ log_p, log1m_p ],
                        [ 0, negative_infinity() ]];
    vector[J] lp;
    vector[2] theta;
    int w;
    
    // forward algorithm for observed individuals
    for (i in 1:I) {
      matrix[2, J] Omega;
      Omega[:, f[i]] = [ 0, negative_infinity() ]';
      if (JS) {
        for (k in 1:f[i]) {
          lp[k] = categorical_lpmf(k | gamma) + P_y[1, y[i, k]];
          for (j in (k + 1):f[i]) {
            lp[k] += P_z[j - 1][1, 1] + P_y[1, y[i, j]];
          }
        }
        Omega[1, f[i]] = log_sum_exp(lp[1:f[i]]);
        b[i] = latent_rng(lp[1:f[i]]);
      }
      for (j in (f[i] + 1):l[i]) {
        Omega[1, j] = Omega[1, j - 1] + P_z[j - 1][1, 1] + P_y[1, y[i, j]];
      }
      Omega[2, l[i]] = negative_infinity();
      for (j in (l[i] + 1):J) {
        Omega[:, j] = log_prod_exp(P_z[j - 1]', Omega[:, j - 1]) + P_y[:, 1];
      }
      log_lik[i] += log_sum_exp(Omega[:, J]);

      // backward sampling algorithm for latent states
      z[i, J] = latent_rng(Omega[:, J]);
      for (j in (l[i] + 1):(J - 1)) {
        int jj = J + l[i] - j;
        z[i, jj] = latent_rng(Omega[:, jj] + P_z[jj][:, z[i, jj + 1]]);
      }
      int start = JS ? b[i] : f[i];
      z[i, start:l[i]] = rep_array(1, l[i] - start + 1);
    }

    // backward sampling algorithm for augmented individuals
    if (JS) {
      array[J] matrix[2, J] Omega;
      for (k in 1:J) {
        Omega[k][:, k] = [ categorical_lpmf(k | gamma) + P_y[1, 1], 
                           negative_infinity() ]';
        for (j in (k + 1):J) {
          Omega[k][:, j] = log_prod_exp(P_z[j - 1]', Omega[k][:, j - 1])
                           + P_y[:, 1];
        }
        lp[k] = log_sum_exp(Omega[k][:, J]);
      }
      theta = [ log1m_psi, 
                log_psi + log_sum_exp(lp) ]';
      for (i in (I + 1):(I + I_aug)) {
        w = latent_rng(theta) - 1;
        if (w) {
          b[i] = latent_rng(lp);
          z[i, b[i]] = 1;
          if (b[i] < J) {
            z[i, J] = latent_rng(Omega[b[i]][:, J]);
            for (j in (b[i] + 1):(J - 1)) {
              int jj = J + b[i] - j;
              z[i, jj] = latent_rng(Omega[b[i]][:, jj] + P_z[jj][:, z[i, jj + 1]]);
            }
          }
        }
          
        // increment superpopulation if entered and check sizes
        N_super += w;
      }
      if (N_super == I + I_aug) {
        print("N_super == I + I_aug. Increase I_aug and try again.");
      }

      // increment population sizes
      for (i in 1:(I + I_aug)) {
        if (b[i]) {
          B[b[i]] += 1;
          for (j in b[i]:J) {
            N[j] += z[i, j] == 1;
          }
        }
      }
    }
  }
}

Since this is a somewhat novel parameterisation of the Jolly-Seber model, I will use simulation-based calibration (SBC) to confirm that we can recover our data-generating process (Talts et al. 2020). The SBC procedure is to take draws from the prior distributions of model parameters, then generating a prior predictive dataset using these parameter values, and then estimating the model parameters using MCMC conditioned on the simulated dataset, and repeating this process many times. If the model is “calibrated”, then the ranks of the posterior draws with regard to the prior distributions should be uniformly distributed, which we can test with various visual diagnostics implemented in the SBC R package (Modrák et al. 2023). This R package has some useful functions for generating datasets and running the SBC model fits.

Code

pacman::p_load(SBC, cmdstanr, future, tidyverse, patchwork)

# input
N_super = 200 ; J = 8 ; I_aug = 400 ; X = 2

# load pre-run SBC
sbc_loc <- here::here("blog/sbc.rds")
if (file.exists(sbc_loc)) {
  sbc <- readRDS(sbc_loc) 
  
} else {
  
  # generate datasets
  datasets <- SBC_generator_function(
    function(N_super, J, I_aug, X) {
      
      # survey intervals, covariates, and parameters
      tau <- rlnorm(J - 1)
      x <- matrix(rnorm(J * X), J, X)
      tau_scl <- tau / sum(tau) * J
      phi <- rbeta(1, 2, 2)
      p <- rbeta(1, 2, 2)
      gamma_a <- numeric(2)
      gamma_a[1] <- rexp(1)
      gamma_b <- rnorm(X)
      gamma_z <- rexp(J - 1)
      
      # Dirichlet entry process
      mu <- (x %*% gamma_b)[, 1]
      mu[2:J] <- mu[2:J] + gamma_a[1] * log(tau_scl) + log(gamma_z)
      gamma <- exp(mu) / sum(exp(mu))
      B <- rmultinom(1, N_super, gamma)[, 1]
      b <- rep(1:J, times = B)
      
      # TPMs
      phi_tau <- phi^tau
      P_z <- array(NA, c(J - 1, 2, 2))
      P_z[, 1, 1] <- phi_tau
      P_z[, 1, 2] <- 1 - phi_tau
      P_z[, 2, 1] <- 0
      P_z[, 2, 2] <- 1
      P_y <- matrix(c(p, 1 - p,
                      1, 0),
                    2, 2, byrow = T)
      
      # function for categorical draws
      rcat <- function(n, prob) {
        return(
          rmultinom(n, size = 1, prob) |> 
            apply(2, \(x) which(x == 1))
        )
      }
      
      # simulate
      z <- matrix(NA, N_super, J)
      y <- matrix(1, N_super, J)
      for (i in 1:N_super) {
        z[i, b[i]] <- 1
        if (b[i] < J) {
          for (j in (b[i] + 1):J) {
            z[i, j] <- rcat(1, P_z[j - 1, z[i, j - 1], ])
          }
        }
        for (j in b[i]:J) {
          y[i, j] <- rcat(1, P_y[z[i, j], ])
        }
      }
      
      # first and last capture, and remove never observed individuals
      f <- apply(y, 1, \(x) min(which(x == 2)))
      obs <- which(!is.infinite(f))
      I <- length(obs)
      y <- y[obs, ]
      f <- f[obs]
      l <- apply(y, 1, \(x) max(which(x == 2)))
      
      # out
      list(variables = list(phi = phi, 
                            p = p, 
                            gamma_a = gamma_a,
                            gamma_b = gamma_b,
                            gamma_z = gamma_z,
                            gamma = gamma, 
                            N_super = N_super, 
                            B = B, 
                            N = apply(z, 2, \(x) length(which(x == 1)))),
           generated = list(I = I, 
                            J = J, 
                            f = f, 
                            l = l, 
                            tau = tau, 
                            X = X, 
                            x = x, 
                            y = y, 
                            I_aug = I_aug))
    }, 
    N_super = N_super, J = J, I_aug = I_aug, X = 2) |> 
    generate_datasets(n_sims = 200)
  
  # backend
  backend <- SBC_backend_cmdstan_sample(cmdstan_model("blog/cmr.stan"),
    # cmr, 
    chains = 2, iter_warmup = 200, iter_sampling = 200
  )
  
  # run and save SBC
  plan(multisession, workers = 8)
  sbc <- compute_SBC(datasets, backend, cores_per_fit = 2, keep_fits = F)
  saveRDS(sbc, here::here("blog/sbc.rds"))
}

I’ll investigate the calibration by using the plot_ecdf_diff() function. If our estimated posterior matches our data-generating process, then the black squiggly lines should be inside in the blue ellipses. For more details, see the rank visualisations vignette on the SBC package website.

Code

sbc$stats |> 
  filter(str_detect(variable, "N\\[|B|gamma\\[")) |> 
  plot_ecdf_diff() + 
  facet_wrap(~ group, scales = "free_y", ncol = J / 2, labeller = label_parsed)

We can also plot our estimates alongside our input, where we can see that we have good coverage.

Code

sbc$stats |> 
  filter(str_detect(variable, "N\\[|B|gamma\\[")) |> 
  plot_sim_estimated() + 
  facet_wrap(~ variable, scales = "free", ncol = J / 2, labeller = label_parsed)

Our recruitment parameters were recovered well alongside survival and detection probabilities.

Code

wrap_plots(
  plot_ecdf_diff(sbc, c("phi", "p", "N_super", "gamma_a[1]", str_c("gamma_b[", 1:X, "]"))) + facet_wrap(~ group, labeller = label_parsed),
  plot_sim_estimated(sbc, c("phi", "p", "N_super", "gamma_a[1]", str_c("gamma_b[", 1:X, "]"))) + facet_wrap(~ variable, labeller = label_parsed, scales = "free"),
  ncol = 1
)

References

Carpenter, B., A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. A. Brubaker, J. Guo, P. Li, and A. Riddell. 2017. Stan: A Probabilistic Programming Language. Journal of Statistical Software 76:1.

Gabry, J., R. Češnovar, A. Johnson, and S. Bronder. 2024. cmdstanr: R Interface to ’CmdStan’.

Modrák, M., A. H. Moon, S. Kim, P. Bürkner, N. Huurre, K. Faltejsková, A. Gelman, and A. Vehtari. 2023. Simulation-based calibration checking for Bayesian computation: The choice of test quantities shapes sensitivity. Bayesian Analysis -1.

Royle, J. A., and R. M. Dorazio. 2012. Parameter-expanded data augmentation for Bayesian analysis of capture–recapture models. Journal of Ornithology 152:521–537.

Talts, S., M. Betancourt, D. Simpson, A. Vehtari, and A. Gelman. 2020, October. Validating Bayesian Inference Algorithms with Simulation-Based Calibration. arXiv.

Wickham, H., M. Averick, J. Bryan, W. Chang, L. D. McGowan, R. François, G. Grolemund, A. Hayes, L. Henry, J. Hester, M. Kuhn, T. L. Pedersen, E. Miller, S. M. Bache, K. Müller, J. Ooms, D. Robinson, D. P. Seidel, V. Spinu, K. Takahashi, D. Vaughan, C. Wilke, K. Woo, and H. Yutani. 2019. Welcome to the tidyverse. Journal of Open Source Software 4:1686.

Footnotes

We cannot distinguish permanent emigration out of the study area from mortality.↩︎
The observation TPM is \(P_y = \begin{bmatrix} 1-p & p \\ 1 & 0 \end{bmatrix}\).↩︎
\(I_\textrm{aug}\) does need to be sufficiently large to ensure \(N_\text{super} < I + I_\textrm{aug}\).↩︎
I’m sure it’s possible with the zeros or ones tricks, but I never used those.↩︎