Individual-based mark-recapture methods, where individual animals are identified through unique markings or through implants like passive integrated transponder (PIT) tags, form the backbone of many demographic studies. The canonical mark-recapture model, the Cormack-Jolly-Seber (CJS) model, estimates apparent1 individual survival probabilities while accounting for imperfect detection. Animal population sizes are often of interest which can be derived from mark-recapture models by summing the number of individuals that are estimated to be alive during a given survey. However, since CJS models condition on an individual’s first capture, the population sizes derived from these models always pertain to the subset of the population that was captured at least once during the study. In order to get unbiased population sizes, we also need to account for individuals that were alive in the study area but never captured—this is where Jolly-Seber models come in.
(Cormack-)Jolly-Seber models
I have already blogged about CJS models, but I will briefly explain the assumed data-generating process here. Consider individuals \(i \in 1:I\) that were ever captured during surveys \(j \in 1:J\). After the survey of an individual’s first capture \((f_i + 1):J\), the multistate formulation of the survival process is described by the following transition probability matrix (TPM), where the rows represent the alive state at survey \(j-1\) (where state 1 is alive and state 2 is dead) and the columns the state at survey \(j\), with \(\tau_j\) being the length of time between consecutive surveys:
Thus, individuals survive a survey interval with survival probability \(\phi^\tau\) and cannot come back from the dead with probability 1. Conditional on an individual being alive during the survey, they are recaptured with detection probability \(p\)2, yielding an \(I \cdot J\) detection history \(\boldsymbol{Y}\) indicating whether individuals were detected during surveys or not (\(y_{ij} = 1\) indicates non-detection, \(y_{ij} = 2\) indicates detection). Survival and detection probabilities can be modeled flexibly at the level of the individual and survey. When we use the forward algorithm to compute the individual-level log likelihood in Stan (Carpenter et al. 2017), we can use the backward sampling algorithm to recover the \(I \cdot J\) matrix \(\boldsymbol{Z}\) of latent alive states and compute derived quantities, such as the number of alive individuals in each survey. As aforementioned, however, the usefulness of these quantities is limited because they provide no information about individuals that were never captured.
Jolly-Seber models provide a remedy by modeling the entry process, or how individuals are entering the study before being captured the first time. By explicitly modeling the entry process, we can accommodate individuals that may have been alive in the study area but were never captured, and can thus provide an unbiased estimate of population size. Consider we have a latent “superpopulation” \(N_\text{super}\) which is the total number of individuals that were ever alive in the area during the study period, of which only the subset \(I\) were ever captured during the study. We assume that \(\boldsymbol{B} = B_1, \dots, B_J\) individuals “entered” the study in discrete pulses during each survey with entry probabilities \(\boldsymbol{\gamma} = \gamma_1, \dots, \gamma_J\) so that \(\sum_{j=1}^J B_j = N_\text{super}\) and \(\boldsymbol{B} \sim \textrm{Multinomial} \left( N_\text{super}, \boldsymbol{\gamma}\right)\). This implies that each individual’s entry occasion \(b_j \sim \textrm{Categorical}\left(\boldsymbol{\gamma}\right)\).
Of course, we do not know \(N_\text{super}\), but we can estimate it by using data augmentation (Royle and Dorazio 2012). Since we know that the number of observed individuals \(I \leq N_\textrm{super}\), we can augment our observed capture history \(\boldsymbol{Y}\) with a suffuciently large \(I_\textrm{aug}\) capture histories of no detections. Some of these augmented individuals could have been in the study area and never captured, whereas others were pseudo-individuals that never existed. By modeling \(N_\text{super} \sim \textrm{Binomial}\left(I + I_\textrm{aug}, \psi\right)\), we can estimate \(N_\text{super}\) as well as the entry probabilities and latent survey-level population sizes. The inclusion probability \(\psi\) is purely a nuisance parameter because its posterior distribution will change depending on the somewhat arbitrary choice of \(I_\textrm{aug}\).3
Implementation in Stan
When using JAGS or NIMBLE, we would struggle to fit this model as we’d have to use a loop for (j in (b[i] + 1):J), where the b[i]s are discrete parameters which are not allowed in indexing.4 In Stan, since we already have to marginalise out all discrete parameters, it is more familiar to marginalise them out. Furthermore, in the absence of individual-level effects (e.g., covariates or random effects), all \(I_\textrm{aug}\) unobserved individuals have the same log likelihood which can be computed once and subsequently multiplied by \(I_\textrm{aug}\). Also note that we only have to marginalise the possible entry occasions for observed individual up to their first capture. Below is a Stan program that implements the model, including the backward sampling algorithm that’s required to recover the posterior distributions of the latent states which we use to estimate population sizes. I wrote the model in such a way that if \(I_\textrm{aug} = 0\), the model defaults to a CJS model, but otherwise the JS model is used. If \(I_\textrm{aug}\) is too small and the estimated \(N_\text{super} = I + I_\textrm{aug}\), warning messages pop up to alert the user they should increase \(I_\textrm{aug}\).
Since the first version of this model where I simply modeled the entry probabilities as drawn from an uninformative Dirichlet distribution, I have attempted to model the entry probabilities as a function of the time interval between surveys, just like for survival. The idea is to model the expected number of entries on the log scale \(\mu\), where for all surveys after the first one we have a time interval (there’s no time interval for the time up to the first survey). I also included survey-level covariates for the entry probabilities, which pertain to each of the \(J\) surveys. Although typically in Dirichlet regression we have to fix one of the outcomes ot 0, I’m not sure if this is strictly necessary. SBC seems to suggest that we can actually model covariates at the level of each survey. I am generating survey-level random variates gamma_z from an exponential distribution as this seems most appropriate as the Dirichlet can be considered a vector of normalised rates generated by \(\textrm{Gamma}(\alpha, 1)\) distributions.
Since this is a somewhat novel parameterisation of the Jolly-Seber model, I will use simulation-based calibration (SBC) to confirm that we can recover our data-generating process (Talts et al. 2020). The SBC procedure is to take draws from the prior distributions of model parameters, then generating a prior predictive dataset using these parameter values, and then estimating the model parameters using MCMC conditioned on the simulated dataset, and repeating this process many times. If the model is “calibrated”, then the ranks of the posterior draws with regard to the prior distributions should be uniformly distributed, which we can test with various visual diagnostics implemented in the SBC R package (Modrák et al. 2023). This R package has some useful functions for generating datasets and running the SBC model fits.
Code
pacman::p_load(SBC, cmdstanr, future, tidyverse, patchwork)# inputN_super =200 ; J =8 ; I_aug =400 ; X =2# load pre-run SBCsbc_loc <- here::here("blog/sbc.rds")if (file.exists(sbc_loc)) { sbc <-readRDS(sbc_loc) } else {# generate datasets datasets <-SBC_generator_function(function(N_super, J, I_aug, X) {# survey intervals, covariates, and parameters tau <-rlnorm(J -1) x <-matrix(rnorm(J * X), J, X) tau_scl <- tau /sum(tau) * J phi <-rbeta(1, 2, 2) p <-rbeta(1, 2, 2) gamma_a <-numeric(2) gamma_a[1] <-rexp(1) gamma_b <-rnorm(X) gamma_z <-rexp(J -1)# Dirichlet entry process mu <- (x %*% gamma_b)[, 1] mu[2:J] <- mu[2:J] + gamma_a[1] *log(tau_scl) +log(gamma_z) gamma <-exp(mu) /sum(exp(mu)) B <-rmultinom(1, N_super, gamma)[, 1] b <-rep(1:J, times = B)# TPMs phi_tau <- phi^tau P_z <-array(NA, c(J -1, 2, 2)) P_z[, 1, 1] <- phi_tau P_z[, 1, 2] <-1- phi_tau P_z[, 2, 1] <-0 P_z[, 2, 2] <-1 P_y <-matrix(c(p, 1- p,1, 0),2, 2, byrow = T)# function for categorical draws rcat <-function(n, prob) {return(rmultinom(n, size =1, prob) |>apply(2, \(x) which(x ==1)) ) }# simulate z <-matrix(NA, N_super, J) y <-matrix(1, N_super, J)for (i in1:N_super) { z[i, b[i]] <-1if (b[i] < J) {for (j in (b[i] +1):J) { z[i, j] <-rcat(1, P_z[j -1, z[i, j -1], ]) } }for (j in b[i]:J) { y[i, j] <-rcat(1, P_y[z[i, j], ]) } }# first and last capture, and remove never observed individuals f <-apply(y, 1, \(x) min(which(x ==2))) obs <-which(!is.infinite(f)) I <-length(obs) y <- y[obs, ] f <- f[obs] l <-apply(y, 1, \(x) max(which(x ==2)))# outlist(variables =list(phi = phi, p = p, gamma_a = gamma_a,gamma_b = gamma_b,gamma_z = gamma_z,gamma = gamma, N_super = N_super, B = B, N =apply(z, 2, \(x) length(which(x ==1)))),generated =list(I = I, J = J, f = f, l = l, tau = tau, X = X, x = x, y = y, I_aug = I_aug)) }, N_super = N_super, J = J, I_aug = I_aug, X =2) |>generate_datasets(n_sims =200)# backend backend <-SBC_backend_cmdstan_sample(cmdstan_model("blog/cmr.stan"),# cmr, chains =2, iter_warmup =200, iter_sampling =200 )# run and save SBCplan(multisession, workers =8) sbc <-compute_SBC(datasets, backend, cores_per_fit =2, keep_fits = F)saveRDS(sbc, here::here("blog/sbc.rds"))}
I’ll investigate the calibration by using the plot_ecdf_diff() function. If our estimated posterior matches our data-generating process, then the black squiggly lines should be inside in the blue ellipses. For more details, see the rank visualisations vignette on the SBC package website.
Carpenter, B., A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. A. Brubaker, J. Guo, P. Li, and A. Riddell. 2017. Stan: A Probabilistic Programming Language. Journal of Statistical Software 76:1.
Gabry, J., R. Češnovar, A. Johnson, and S. Bronder. 2024. cmdstanr: R Interface to ’CmdStan’.
Wickham, H., M. Averick, J. Bryan, W. Chang, L. D. McGowan, R. François, G. Grolemund, A. Hayes, L. Henry, J. Hester, M. Kuhn, T. L. Pedersen, E. Miller, S. M. Bache, K. Müller, J. Ooms, D. Robinson, D. P. Seidel, V. Spinu, K. Takahashi, D. Vaughan, C. Wilke, K. Woo, and H. Yutani. 2019. Welcome to the tidyverse. Journal of Open Source Software 4:1686.
Footnotes
We cannot distinguish permanent emigration out of the study area from mortality.↩︎
The observation TPM is \(P_y = \begin{bmatrix} 1-p & p \\ 1 & 0 \end{bmatrix}\).↩︎
\(I_\textrm{aug}\) does need to be sufficiently large to ensure \(N_\text{super} < I + I_\textrm{aug}\).↩︎
I’m sure it’s possible with the zeros or ones tricks, but I never used those.↩︎