Content from Create a short-term forecast


Last updated on 2024-05-04 | Edit this page

Estimated time: 60 minutes

Overview

Questions

  • How do I create short-term forecasts from case data?
  • How do I account for incomplete reporting in forecasts?

Objectives

  • Learn how to make forecasts of cases using R package EpiNow2
  • Learn how to include an observation process in estimation

Prerequisites

Learners should familiarise themselves with following concept dependencies before working through this tutorial:

Statistics : probability distributions, principle of Bayesian analysis.

Epidemic theory : Effective reproduction number.

Introduction


Given case data of an epidemic, we can create estimates of the current and future number of cases by accounting for both delays in reporting and under reporting. To make statements about the future of th epidemic, we need to make an assumption of how observations up to the present are related to what we expect to happen in the future. The simplest way of doing so is to assume “no change”, i.e., the reproduction number remains the same in the future as last observed. In this tutorial we will create short-term forecasts by assuming the reproduction number will remain the same as its estimate was on the final date for which data was available.

In this tutorial we are going to learn how to use the EpiNow2 package to forecast cases accounting for incomplete observations and forecast secondary observations like deaths.

We’ll use the pipe %>% operator to connect functions, so let’s also call to the tidyverse package:

R

library(EpiNow2)
library(tidyverse)

Create a short-term forecast


The function epinow() described in the quantifying transmission episode is a wrapper for the functions:

  • estimate_infections() used to estimate cases by date of infection.
  • forecast_infections() used to simulate infections using an existing fit (estimate) to observed cases.

Let’s use the same code used in quantifying transmission episode to get the input data, delays and priors:

R

# Read cases dataset
cases <- incidence2::covidregionaldataUK %>%
  select(date, cases_new) %>%
  group_by(date) %>%
  summarise(confirm = sum(cases_new, na.rm = TRUE)) %>%
  ungroup()

# Incubation period
incubation_period_fixed <- EpiNow2::dist_spec(
  mean = 4, sd = 2,
  max = 20, distribution = "gamma"
)

# Log-tranformed mean
log_mean <- EpiNow2::convert_to_logmean(2, 1)

# Log-transformed std
log_sd <- EpiNow2::convert_to_logsd(2, 1)

# Reporting dalay
reporting_delay_fixed <- EpiNow2::dist_spec(
  mean = log_mean, sd = log_sd,
  max = 10, distribution = "lognormal"
)

# Generation time
generation_time_fixed <- EpiNow2::dist_spec(
  mean = 3.6, sd = 3.1,
  max = 20, distribution = "lognormal"
)

# Rt mean prior
rt_log_mean <- EpiNow2::convert_to_logmean(2, 1)

# Rt std prior
rt_log_sd <- EpiNow2::convert_to_logsd(2, 1)

Now we can extract the short-term forecast using:

R

# Assume we only have the first 90 days of this data
reported_cases <- cases[1:90, ]

# Estimate and forecast
estimates <- EpiNow2::epinow(
  reported_cases = reported_cases,
  generation_time = generation_time_opts(generation_time_fixed),
  delays = delay_opts(incubation_period_fixed + reporting_delay_fixed),
  rt = rt_opts(prior = list(mean = rt_log_mean, sd = rt_log_sd))
)

Do not wait for this to complete!

This last chunk may take 10 minutes to run. Keep reading this tutorial episode while this runs in the background. For more information on computing time, read the “Bayesian inference using Stan” section within the quantifying transmission episode.

We can visualise the estimates of the effective reproduction number and the estimated number of cases using plot(). The estimates are split into three categories:

  • Estimate (green): utilises all data,

  • Estimate based on partial data (orange): contains a higher degree of uncertainty because such estimates are based on less data,

  • Forecast (purple): forecasts into the future.

R

plot(estimates)

Forecasting with incomplete observations

In the quantifying transmission episode we accounted for delays in reporting.In EpiNow2 we also can account for incomplete observations as in reality, 100% of cases are not reported. We will pass another argument into epinow() function called obs to define an observation model. The format of obs is defined by the obs_opt() function (see ?EpiNow2::obs_opts for more detail).

Let’s say we believe the COVID-19 outbreak data in the cases object do not include all reported cases. We believe that we only observe 40% of cases. To specify this in the observation model, we must pass a scaling factor with a mean and standard deviation. If we assume that 40% of cases are in the case data (with standard deviation 1%), then we specify the scale input to obs_opts() as follows:

R

obs_scale <- base::list(mean = 0.4, sd = 0.01)

To run the inference framework with this observation process, we add obs = obs_opts(scale = obs_scale) to the input arguments of epinow():

R

# Define observation model
obs_scale <- base::list(mean = 0.4, sd = 0.01)

# Assume we only have the first 90 days of this data
reported_cases <- cases[1:90, ]

# Estimate and forecast
estimates <- EpiNow2::epinow(
  reported_cases = reported_cases,
  generation_time = generation_time_opts(generation_time_fixed),
  delays = delay_opts(incubation_period_fixed + reporting_delay_fixed),
  rt = rt_opts(prior = list(mean = rt_log_mean, sd = rt_log_sd)),
  # Add observation model
  obs = EpiNow2::obs_opts(scale = obs_scale)
)

OUTPUT

WARN [2024-05-04 02:26:33] epinow: There were 2 divergent transitions after warmup. See
https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them. - 
WARN [2024-05-04 02:26:33] epinow: Examine the pairs() plot to diagnose sampling problems
 - 

R

base::summary(estimates)

OUTPUT

                                 measure                 estimate
                                  <char>                   <char>
1: New confirmed cases by infection date   18016 (10081 -- 30016)
2:        Expected change in daily cases        Likely decreasing
3:            Effective reproduction no.       0.89 (0.57 -- 1.3)
4:                        Rate of growth -0.015 (-0.064 -- 0.034)
5:          Doubling/halving time (days)          -46 (21 -- -11)

The estimates of transmission measures such as the effective reproduction number and rate of growth are similar (or the same in value) compared to when we didn’t account for incomplete observations (see quantifying transmission episode in the “Finding estimates” section). However the number of new confirmed cases by infection date has changed substantially in magnitude to reflect the assumption that only 40% of cases are in the data set.

We can also change the default distribution from Negative Binomial to Poisson, remove the default week effect and more. See ?EpiNow2::obs_opts for more details.

What are the implications of this change?

  • Compare different percents of observations %
  • How are they different in the number of infections estimated?
  • What are the public health implications of this change?

Forecasting secondary observations


EpiNow2 also has the ability to estimate and forecast secondary observations, e.g., deaths and hospitalisations, from a primary observation, e.g., cases. Here we will illustrate how to forecast the number of deaths arising from observed cases of COVID-19 in the early stages of the UK outbreak.

First, we must format our data to have the following columns:

  • date: the date (as a date object see ?is.Date()),
  • primary: number of primary observations on that date, in this example cases,
  • secondary: number of secondary observations date, in this example deaths.

R

reported_cases_deaths <- incidence2::covidregionaldataUK %>%
  select(date, cases_new, deaths_new) %>%
  group_by(date) %>%
  summarise(
    primary = sum(cases_new, na.rm = TRUE),
    secondary = sum(deaths_new, na.rm = TRUE)
  ) %>%
  ungroup()
Distribution of secondary cases (deaths). We will drop the first 30 days with no observed deaths. We will use the deaths between day 31 and day 60 to estimate the secondary observations. We will forecast deaths from day 61 to day 90.
Distribution of secondary cases (deaths). We will drop the first 30 days with no observed deaths. We will use the deaths between day 31 and day 60 to estimate the secondary observations. We will forecast deaths from day 61 to day 90.

Using the data on cases and deaths between day 31 and day 60, we will estimate the relationship between the primary and secondary observations using estimate_secondary(), then forecast future deaths using forecast_secondary(). For more details on the model see the model documentation.

We must specify the type of observation using type in secondary_opts(), options include:

  • “incidence”: secondary observations arise from previous primary observations, i.e., deaths arising from recorded cases.
  • “prevalence”: secondary observations arise from a combination current primary observations and past secondary observations, i.e., hospital bed usage arising from current hospital admissions and past hospital bed usage.

In this example we specify secondary_opts(type = "incidence"). See ?EpiNow2::secondary_opts for more detail.

The final key input is the delay distribution between the primary and secondary observations. Here this is the delay between case report and death, we assume this follows a gamma distribution with mean of 14 days and standard deviation of 5 days (Alternatively, we can use {epiparameter} to access epidemiological delays). Using dist_spec() we specify a fixed gamma distribution.

There are further function inputs to estimate_secondary() which can be specified, including adding an observation process, see ?EpiNow2::estimate_secondary for detail on these options.

To find the model fit between cases and deaths:

R

# Estimate from day 31 to day 60 of this data
cases_to_estimate <- reported_cases_deaths[31:60, ]

# Estimate secondary cases
estimate_cases_to_deaths <- EpiNow2::estimate_secondary(
  reports = cases_to_estimate,
  secondary = EpiNow2::secondary_opts(type = "incidence"),
  delays = EpiNow2::delay_opts(EpiNow2::dist_spec(
    mean = 14, sd = 5,
    max = 30, distribution = "gamma"
  )
  )
)

Be cautious of time-scale

In the early stages of an outbreak there can be substantial changes in testing and reporting. If there are testing changes from one month to another, then there will be a bias in the model fit. Therefore, you should be cautious of the time-scale of data used in the model fit and forecast.

We plot the model fit (shaded ribbons) with the secondary observations (bar plot) and primary observations (dotted line) as follows:

R

plot(estimate_cases_to_deaths, primary = TRUE)

To use this model fit to forecast deaths, we pass a data frame consisting of the primary observation (cases) for dates not used in the model fit.

Note : in this episode we are using data where we know the deaths and cases, so we create a data frame by extracting the cases. But in practice, this would be a different data set consisting of cases only.

R

# Forecast from day 61 to day 90
cases_to_forecast <- reported_cases_deaths[61:90, c("date", "primary")]
colnames(cases_to_forecast) <- c("date", "value")

To forecast, we use the model fit estimate_cases_to_deaths:

R

# Forecast secondary cases
deaths_forecast <- EpiNow2::forecast_secondary(
  estimate = estimate_cases_to_deaths,
  primary = cases_to_forecast
)

plot(deaths_forecast)

WARNING

Warning: Removed 30 rows containing missing values (`position_stack()`).

The plot shows the forecast secondary observations (deaths) over the dates which we have recorded cases for. It is also possible to forecast deaths using forecast cases, here you would specify primary as the estimates output from estimate_infections().

Credible intervals

In all EpiNow2 output figures, shaded regions reflect 90%, 50%, and 20% credible intervals in order from lightest to darkest.

Challenge: Ebola outbreak analysis


Challenge

Download the file ebola_cases.csv and read it into R. The simulated data consists of the date of symptom onset and number of confirmed cases of the early stages of the Ebola outbreak in Sierra Leone in 2014.

Using the first 3 months (120 days) of data:

  1. Estimate whether cases are increasing or decreasing on day 120 of the outbreak
  2. Account for a capacity to observe 80% of cases.
  3. Create a two week forecast of number of cases.

You can use the following parameter values for the delay distribution(s) and generation time distribution.

You may include some uncertainty around the mean and standard deviation of these distributions.

We use the effective reproduction number and growth rate to estimate whether cases are increasing or decreasing.

We can use the horizon argument within the epinow() function to extend the time period of the forecast. The default value is of seven days.

Ensure the data is in the correct format :

  • date: the date (as a date object see ?is.Date()),
  • confirm: number of confirmed cases on that date.

To estimate the effective reproduction number and growth rate, we will use the function epinow().

As the data consists of date of symptom onset, we only need to specify a delay distribution for the incubation period and the generation time.

We specify the distributions with some uncertainty around the mean and standard deviation of the log normal distribution for the incubation period and the Gamma distribution for the generation time.

R

ebola_incubation_period <- EpiNow2::dist_spec(
  mean = 2.487, sd = 0.330,
  mean_sd = 0.5, sd_sd = 0.5,
  max = 20, distribution = "lognormal"
)

ebola_generation_time <- EpiNow2::dist_spec(
  mean = 15.3, sd = 10.1,
  mean_sd = 0.5, sd_sd = 0.5,
  max = 30, distribution = "gamma"
)

We read the data input using readr::read_csv(). This function recognize that the column date is a <date> class vector.

R

# read data
# e.g.: if path to file is data/raw-data/ebola_cases.csv then:
ebola_cases <- readr::read_csv(
  here::here("data", "raw-data", "ebola_cases.csv")
)

We define an observation model to scale the estimated and forecast number of new infections:

R

# Define observation model
# mean of 80% and standard deviation of 1%
ebola_obs_scale <- list(mean = 0.8, sd = 0.01)

As we want to also create a two week forecast, we specify horizon = 14 to forecast 14 days instead of the default 7 days.

R

ebola_estimates <- EpiNow2::epinow(
  reported_cases = ebola_cases[1:120, ], # first 3 months of data only
  generation_time = generation_time_opts(ebola_generation_time),
  delays = EpiNow2::delay_opts(ebola_incubation_period),
  # Add observation model
  obs = obs_opts(scale = ebola_obs_scale),
  # horizon needs to be 14 days to create two week forecast (default is 7 days)
  horizon = 14
)

OUTPUT

WARN [2024-05-04 02:28:45] epinow: There were 13 divergent transitions after warmup. See
https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them. - 
WARN [2024-05-04 02:28:45] epinow: Examine the pairs() plot to diagnose sampling problems
 - 

R

summary(ebola_estimates)

OUTPUT

                                 measure                estimate
                                  <char>                  <char>
1: New confirmed cases by infection date         126 (57 -- 325)
2:        Expected change in daily cases              Increasing
3:            Effective reproduction no.            1.7 (1 -- 3)
4:                        Rate of growth 0.043 (0.0011 -- 0.092)
5:          Doubling/halving time (days)         16 (7.5 -- 620)

The effective reproduction number \(R_t\) estimate (on the last date of the data) is 1.7 (1 – 3). The exponential growth rate of case numbers is 0.043 (0.0011 – 0.092).

Visualize the estimates:

R

plot(ebola_estimates)

Forecasting with estimates of\(R_t\)

By default, the short-term forecasts are created using the latest estimate of the reproduction number \(R_t\). As this estimate is based on partial data, it has considerable uncertainty.

The reproduction number that is projected into the future can be changed to a less recent estimate based on more data using rt_opts():

R

rt_opts(future = "estimate")

The result will be less uncertain forecasts (as they are based on \(R_t\) with a narrower uncertainty interval) but the forecasts will be based on less recent estimates of \(R_t\) and assume no change since then.

Additionally, there is the option to project the value of \(R_t\) into the future using a generic model by setting future = "project". As this option uses a model to forecast the value of \(R_t\), the result will be forecasts that are more uncertain than estimate, for an example see here.

Summary


EpiNow2 can be used to create short term forecasts and to estimate the relationship between different outcomes. There are a range of model options that can be implemented for different analysis, including adding an observational process to account for incomplete reporting. See the vignette for more details on different model options in EpiNow2 that aren’t covered in these tutorials.

Key Points

  • We can create short-term forecasts by making assumptions about the future behaviour of the reproduction number
  • Incomplete case reporting can be accounted for in estimates

Content from Estimation of outbreak severity


Last updated on 2024-05-04 | Edit this page

Estimated time: 12 minutes

Overview

Questions

  • Why do we estimate the clinical severity of an epidemic?

  • How can the Case Fatality Risk (CFR) be estimated early in an ongoing epidemic?

Objectives

  • Estimate the CFR from aggregated case data using cfr.

  • Estimate a delay-adjusted CFR using {epiparameter} and cfr.

  • Estimate a delay-adjusted severity for an expanding time series using cfr.

Prerequisites

This episode requires you to be familiar with:

Data science: Basic programming with R.

Epidemic theory: Delay distributions.

Introduction


Common questions at the early stage of an epidemic include:

  • What is the likely public health impact of the outbreak in terms of clinical severity?
  • What are the most severely affected groups?
  • Does the outbreak have the potential to cause a very severe pandemic?

We can assess the pandemic potential of an epidemic with two critical measurements: the transmissibility and the clinical severity (Fraser et al., 2009, CDC, 2016).

The horizontal axis is the scaled measure of clinical severity, ranging from 1 to 7, where 1 is low, 4 is moderate, and 7 is very severe. The vertical axis is the scaled measure of transmissibility, ranging from 1 to 5, where 1 is low, 3 is moderate, and 5 is highly transmissible. On the graph, HHS pandemic planning scenarios are labeled across four quadrants (A, B, C and D). From left to right, the scenarios are “seasonal range,” “moderate pandemic,” “severe pandemic” and “very severe pandemic.” As clinical severity increases along the horizontal axis, or as transmissibility increases along the vertical axis, the severity of the pandemic planning scenario also increases.
HHS Pandemic Planning Scenarios based on the Pandemic Severity Assessment Framework. This uses a combined measure of clinical severity and transmissibility to characterise influenza pandemic scenarios. HHS: United States Department of Health and Human Services (CDC, 2016).

One epidemiological approach to estimating the clinical severity is quantifying the Case Fatality Risk (CFR). CFR is the conditional probability of death given confirmed diagnosis, calculated as the cumulative number of deaths from an infectious disease over the number of confirmed diagnosed cases. However, calculating this directly during the course of an epidemic tends to result in a naive or biased CFR given the time delay from onset to death, varying substantially as the epidemic progresses and stabilising at the later stages of the outbreak (Ghani et al., 2005).

The periods are relevant: Period 1 -- 15 days where CFR is zero to indicate this is due to no reported deaths; Period from Mar 15 -- Apr 26 where CFR appears to be rising; Period Apr 30 -- May 30 where the CFR estimate stabilises.
Observed biased confirmed case fatality risk (CFR) estimates as a function of time (thick line) calculated as the cumulative number of deaths over confirmed cases at time t. The estimate at the end of an outbreak (~May 30) is the realised CFR by the end of the epidemic. The horizontal continuous line and dotted lines show the expected value and the 95% confidence intervals (\(95\%\) CI) of the predicted delay-adjusted CFR estimate only by using the observed data until 27 Mar 2003 (Nishiura et al., 2009)

The periods are relevant: Period 1 – 15 days where CFR is zero to indicate this is due to no reported deaths; Period from Mar 15 – Apr 26 where CFR appears to be rising; Period Apr 30 – May 30 where the CFR estimate stabilises.

More generally, estimating severity can be helpful even outside of a pandemic planning scenario and in the context of routine public health. Knowing whether an outbreak has or had a different severity from the historical record can motivate causal investigations, which could be intrinsic to the infectious agent (e.g., a new, more severe strain) or due to underlying factors in the population (e.g. reduced immunity or morbidity factors) (Lipsitch et al., 2015).

In this tutorial we are going to learn how to use the cfr package to calculate and adjust a CFR estimation using delay distributions from {epiparameter} or elsewhere, based on the methods developed by Nishiura et al., 2009, also, how we can reuse cfr functions for more severity measurements.

We’ll use the pipe %>% operator to connect functions, so let’s also call to the tidyverse package:

R

library(cfr)
library(epiparameter)
library(tidyverse)
library(outbreaks)

Data sources for clinical severity


What are data sources can we use to estimate the clinical severity of a disease outbreak? Verity et al., 2020 summarises the spectrum of COVID-19 cases:

Spectrum of COVID-19 cases. The CFR aims to estimate the proportion of Deaths among confirmed cases in an epidemic. (Verity et al., 2020)
Spectrum of COVID-19 cases. The CFR aims to estimate the proportion of Deaths among confirmed cases in an epidemic. (Verity et al., 2020)
  • At the top of the pyramid, those who met the WHO case criteria for severe or critical cases would likely have been identified in the hospital setting, presenting with atypical viral pneumonia. These cases would have been identified in mainland China and among those categorised internationally as local transmission.
  • Many more cases are likely to be symptomatic (i.e., with fever, cough, or myalgia) but might not require hospitalisation. These cases would have been identified through links to international travel to high-risk areas and through contact-tracing of contacts of confirmed cases. They might be identifiable through population surveillance of, for example, influenza-like illness.
  • The bottom part of the pyramid represents mild (and possibly asymptomatic) cases. These cases might be identifiable through contact tracing and subsequently via serological testing.

Naive CFR


We measure disease severity in terms of case fatality risk (CFR). The CFR is interpreted as the conditional probability of death given confirmed diagnosis, calculated as the cumulative number of deaths \(D_{t}\) over the cumulative number of confirmed cases \(C_{t}\) at a certain time \(t\). We can refer to the naive CFR (also crude or biased CFR, \(b_{t}\)):

\[ b_{t} = \frac{D_{t}}{C_{t}} \]

This calculation is naive because it tends to yield a biased and mostly underestimated CFR due to the time-delay from onset to death, only stabilising at the later stages of the outbreak.

To calculate the naive CFR, the cfr package requires an input data frame with three columns named:

  • date
  • cases
  • deaths

Let’s explore the ebola1976 dataset, included in {cfr}, which comes from the first Ebola outbreak in what was then called Zaire (now the Democratic Republic of the Congo) in 1976, as analysed by Camacho et al. (2014).

R

# Load the Ebola 1976 data provided with the {cfr} package
data("ebola1976")

# Assume we only have the first 30 days of this data
ebola1976[1:30, ] %>% as_tibble()

OUTPUT

# A tibble: 30 × 3
   date       cases deaths
   <date>     <int>  <int>
 1 1976-08-25     1      0
 2 1976-08-26     0      0
 3 1976-08-27     0      0
 4 1976-08-28     0      0
 5 1976-08-29     0      0
 6 1976-08-30     0      0
 7 1976-08-31     0      0
 8 1976-09-01     1      0
 9 1976-09-02     1      0
10 1976-09-03     1      0
# ℹ 20 more rows

We need aggregated incidence data

cfr reads aggregated incidence data.

This data input should be aggregated by day, which means one observation per day, containing the daily number of reported cases and deaths. Observations with zero or missing values should also be included, similar to time-series data.

Also, cfr currently works for daily data only, but not for other temporal units of data aggregation, e.g., weeks.

When we apply cfr_static() to data directly, we are calculating the naive CFR:

R

# Calculate the naive CFR for the first 30 days
cfr_static(data = ebola1976[1:30, ])

OUTPUT

  severity_mean severity_low severity_high
1     0.4740741    0.3875497     0.5617606

Challenge

Download the file sarscov2_cases_deaths.csv and read it into R.

Estimate the naive CFR.

Inspect the format of the data input.

  • Does it contain daily data?
  • Does the column names are as required by cfr_static()?
  • How would you rename column names from a data frame?

We read the data input using readr::read_csv(). This function recognize that the column date is a <date> class vector.

R

# read data
# e.g.: if path to file is data/raw-data/ebola_cases.csv then:
sarscov2_input <-
  readr::read_csv(here::here("data", "raw-data", "sarscov2_cases_deaths.csv"))

R

# Inspect data
sarscov2_input

OUTPUT

# A tibble: 93 × 3
   date       cases_jpn deaths_jpn
   <date>         <dbl>      <dbl>
 1 2020-01-20         1          0
 2 2020-01-21         1          0
 3 2020-01-22         0          0
 4 2020-01-23         1          0
 5 2020-01-24         1          0
 6 2020-01-25         3          0
 7 2020-01-26         3          0
 8 2020-01-27         4          0
 9 2020-01-28         6          0
10 2020-01-29         7          0
# ℹ 83 more rows

We can use dplyr::rename() to adapt the external data to fit the data input for cfr_static().

R

# Rename before Estimate naive CFR
sarscov2_input %>%
  dplyr::rename(
    cases = cases_jpn,
    deaths = deaths_jpn
  ) %>%
  cfr_static()

OUTPUT

  severity_mean severity_low severity_high
1    0.01895208   0.01828832    0.01963342

Biases that affect CFR estimation


Two biases that affect CFR estimation

Lipsitch et al., 2015 describe two potential biases that can affect the estimation of CFR (and their potential solutions):

For diseases with a spectrum of clinical presentations, those cases that come to the attention of public health authorities and registered into surveillance databases will typically be people with the most severe symptoms who seek medical care, are admitted to a hospital, or die.

Therefore, the CFR will typically be higher among detected cases than among the entire population of cases, given that the latter may include individuals with mild, subclinical, and (under some definitions of “case”) asymptomatic presentations.

During an ongoing epidemic, there is a delay between the time someone dies and the time their death is reported. Therefore, at any given moment in time, the list of cases includes people who will die and whose death has not yet occurred or has occurred but not yet been reported. Thus, dividing the cumulative number of reported deaths by the cumulative number of reported cases at a specific time point during an outbreak will underestimate the true CFR.

The key determinants of the magnitude of the bias are the epidemic growth rate and the distribution of delays from case-reporting to death-reporting; the longer the delays and the faster the growth rate, the greater the bias.

In this tutorial episode, we are going to focus on solutions to deal with this specific bias using cfr!

Improving an early epidemiological assessment of a delay-adjusted CFR is crucial for determining virulence, shaping the level and choices of public health intervention, and providing advice to the general public.

In 2009, during the swine-flu virus, Influenza A (H1N1), Mexico had an early biased estimation of the CFR. Initial reports from the government of Mexico suggested a virulent infection, whereas, in other countries, the same virus was perceived as mild (TIME, 2009).

In the USA and Canada, no deaths were attributed to the virus in the first ten days following the World Health Organization’s declaration of a public health emergency. Even under similar circumstances at the early stage of the global pandemic, public health officials, policymakers and the general public want to know the virulence of an emerging infectious agent.

Fraser et al., 2009 reinterpreted the data assessing the biases and getting a clinical severity lower than the 1918 influenza pandemic but comparable with that seen in the 1957 pandemic.

We can showcase this last bias using the concept described in this {cfr} vignette.

Delay-adjusted CFR


Nishiura et al., 2009 developed a method that considers the time delay from the onset of symptoms to death.

Real-time outbreaks may have a number of deaths that are insufficient to determine the time distribution between onset and death. Therefore, we can estimate the distribution delay from historical outbreaks or reuse the ones accessible via R packages like {epiparameter} or {epireview}, which collect them from published scientific literature. For an step-by-step guide, read the tutorial episode on how to access to epidemiological delays.

Let’s use {epiparameter}:

R

# Get delay distribution
onset_to_death_ebola <-
  epiparameter::epidist_db(
    disease = "Ebola",
    epi_dist = "onset_to_death",
    single_epidist = TRUE
  )

# Plot <epidist> object
plot(onset_to_death_ebola, day_range = 0:40)

To calculate the delay-adjusted CFR, we can use the cfr_static() function with the data and delay_density arguments.

R

# Calculate the delay-adjusted CFR
# for the first 30 days
cfr_static(
  data = ebola1976[1:30, ],
  delay_density = function(x) density(onset_to_death_ebola, x)
)

OUTPUT

  severity_mean severity_low severity_high
1         0.955         0.74             1

The delay-adjusted CFR indicated that the overall disease severity at the end of the outbreak or with the latest data available at the moment is 0.955 with a 95% confidence interval between 0.74 and 1, slightly higher than the naive one.

Use the epidist class

When using an <epidist> class object we can use this expression as a template:

function(x) density(<EPIDIST_OBJECT>, x)

For distribution functions with parameters not available in {epiparameter}, we suggest you two alternatives:

Challenge

Use the same file from the previous challenge (sarscov2_cases_deaths.csv).

Estimate the delay-adjusted CFR using the appropriate distribution delay. Then:

  • Compare the naive and the delay-adjusted CFR solutions!
  • Find the appropriate <epidist> object!

We use {epiparameter} to access a delay distribution for the SARS-CoV-2 aggregated incidence data:

R

library(epiparameter)

sarscov2_delay <-
  epidist_db(
    disease = "covid",
    epi_dist = "onset to death",
    single_epidist = TRUE
  )

We read the data input using readr::read_csv(). This function recognize that the column date is a <date> class vector.

R

# read data
# e.g.: if path to file is data/raw-data/ebola_cases.csv then:
sarscov2_input <-
  readr::read_csv(here::here("data", "raw-data", "sarscov2_cases_deaths.csv"))

R

# Inspect data
sarscov2_input

OUTPUT

# A tibble: 93 × 3
   date       cases_jpn deaths_jpn
   <date>         <dbl>      <dbl>
 1 2020-01-20         1          0
 2 2020-01-21         1          0
 3 2020-01-22         0          0
 4 2020-01-23         1          0
 5 2020-01-24         1          0
 6 2020-01-25         3          0
 7 2020-01-26         3          0
 8 2020-01-27         4          0
 9 2020-01-28         6          0
10 2020-01-29         7          0
# ℹ 83 more rows

We can use dplyr::rename() to adapt the external data to fit the data input for cfr_static().

R

# Rename before Estimate naive CFR
sarscov2_input %>%
  dplyr::rename(
    cases = cases_jpn,
    deaths = deaths_jpn
  ) %>%
  cfr_static(delay_density = function(x) density(sarscov2_delay, x))

OUTPUT

  severity_mean severity_low severity_high
1         0.055        0.054         0.057

Interpret the comparison between the naive and delay-adjusted CFR estimates.

For cfr_static() and all the cfr_*() family of functions, the most appropriate choice to pass are discrete distributions. This is because we will work with daily case and death data.

We can assume that evaluating the Probability Distribution Function (PDF) of a continuous distribution is equivalent to the Probability Mass Function (PMF) of the equivalent discrete distribution.

However, this assumption may not be appropriate for distributions with larger peaks. For instance, diseases with an onset-to-death distribution that is strongly peaked with a low variance. In such cases, the average disparity between the PDF and PMF is expected to be more pronounced compared to distributions with broader spreads. One way to deal with this is to discretise the continuous distribution using epiparameter::discretise() to an <epidist> object.

To adjust the CFR, Nishiura et al., 2009 use the case and death incidence data to estimate the number of cases with known outcomes:

\[ u_t = \dfrac{\sum_{i = 0}^t \sum_{j = 0}^\infty c_{i - j} f_{j}}{\sum_{i = 0} c_i}, \]

where:

  • \(c_{t}\) is the daily case incidence at time \(t\),
  • \(f_{t}\) is the value of the Probability Mass Function (PMF) of the delay distribution between onset and death, and
  • \(u_{t}\) represents the underestimation factor of the known outcomes.

\(u_{t}\) is used to scale the value of the cumulative number of cases in the denominator in the calculation of the CFR. This is calculated internally with the estimate_outcomes() function.

The estimator for CFR can be written as:

\[p_{t} = \frac{b_{t}}{u_{t}}\]

where \(p_{t}\) is the realized proportion of confirmed cases to die from the infection (or the unbiased CFR), and \(b_{t}\), the crude and biased estimate of CFR (also naive CFR).

From this last equation, we observe that the unbiased CFR \(p_{t}\) is larger than biased CFR \(b_{t}\) because in \(u_{t}\) the numerator is smaller than the denominator (note that \(f_{t}\) is the probability distribution of the delay distribution between onset and death). Therefore, we refer to \(b_{t}\) as the biased estimator of CFR.

When we observe the entire course of an epidemic (from \(t \rightarrow \infty\)), \(u_{t}\) tends to 1, making \(b_{t}\) tends to \(p_{t}\) and become an unbiased estimator (Nishiura et al., 2009).

An early-stage CFR estimate


On the challenge above, we discovered that the naive and delay-adjusted CFR estimates are different.

The naive estimate is useful to get an overall severity estimate of the outbreak (so far). Once the outbreak has ended or has progressed such that more deaths are reported, the estimated CFR is then closest to the ‘true’ unbiased CFR.

On the other hand, the delay-adjusted estimate can assess the severity of an emerging infectious disease earlier than the biased or naive CFR, during an epidemic.

We can explore the early determination of the delay-adjusted CFR using the cfr_rolling() function.

Callout

cfr_rolling() is a utility function that automatically calculates CFR for each day of the outbreak with the data available up to that day, saving the user time.

R

# Calculate the rolling daily naive CFR
# for all the 73 days in the Ebola dataset
rolling_cfr_naive <- cfr::cfr_rolling(data = ebola1976)

utils::tail(rolling_cfr_naive)

OUTPUT

         date severity_mean severity_low severity_high
68 1976-10-31     0.9510204    0.9160061     0.9744387
69 1976-11-01     0.9510204    0.9160061     0.9744387
70 1976-11-02     0.9510204    0.9160061     0.9744387
71 1976-11-03     0.9510204    0.9160061     0.9744387
72 1976-11-04     0.9510204    0.9160061     0.9744387
73 1976-11-05     0.9551020    0.9210866     0.9773771

R

# Calculate the rolling daily delay-adjusted CFR
# for all the 73 days in the Ebola dataset
rolling_cfr_adjusted <- cfr::cfr_rolling(
  data = ebola1976,
  delay_density = function(x) density(onset_to_death_ebola, x)
)

utils::tail(rolling_cfr_adjusted)

OUTPUT

         date severity_mean severity_low severity_high
68 1976-10-31         0.975        0.856             1
69 1976-11-01         0.975        0.856             1
70 1976-11-02         0.971        0.852             1
71 1976-11-03         0.971        0.852             1
72 1976-11-04         0.971        0.852             1
73 1976-11-05         0.971        0.852             1

With utils::tail(), we show that the latest CFR estimates. The naive and delay-adjusted estimates have overlapping ranges of 95% confidence intervals.

Now, let’s visualise both results in a time series. How would the naive and delay-adjusted CFR estimates perform in real time?

R

# get the latest delay-adjusted CFR
rolling_cfr_adjusted_end <-
  rolling_cfr_adjusted %>%
  dplyr::slice_tail()

# bind by rows both output data frames
bind_rows(
  rolling_cfr_naive %>%
    mutate(method = "naive"),
  rolling_cfr_adjusted %>%
    mutate(method = "adjusted")
) %>%
  # visualise both adjusted and unadjusted rolling estimates
  ggplot() +
  geom_ribbon(
    aes(
      date,
      ymin = severity_low,
      ymax = severity_high,
      fill = method
    ),
    alpha = 0.2, show.legend = FALSE
  ) +
  geom_line(
    aes(date, severity_mean, colour = method)
  ) +
  geom_hline(
    data = rolling_cfr_adjusted_end,
    aes(yintercept = severity_mean)
  ) +
  geom_hline(
    data = rolling_cfr_adjusted_end,
    aes(yintercept = severity_low),
    lty = 2
  ) +
  geom_hline(
    data = rolling_cfr_adjusted_end,
    aes(yintercept = severity_high),
    lty = 2
  )

The horizontal line represents the delay-adjusted CFR estimated at the outbreak’s end. The dotted line means the estimate has a 95% confidence interval (95% CI).

Notice that this delay-adjusted calculation is particularly useful when an epidemic curve of confirmed cases is the only data available (i.e. when individual data from onset to death are unavailable, especially during the early stage of the epidemic). When there are few deaths or none at all, an assumption has to be made for the delay distribution from onset to death, e.g. from literature based on previous outbreaks. Nishiura et al., 2009 depict this in the figures with data from the SARS outbreak in Hong Kong, 2003.

Figures A and B show the cumulative numbers of cases and deaths of SARS, and Figure C shows the observed (biased) CFR estimates as a function of time, i.e. the cumulative number of deaths over cases at time \(t\). Due to the delay from the onset of symptoms to death, the biased estimate of CFR at time \(t\) underestimates the realised CFR at the end of an outbreak (i.e. 302/1755 = 17.2 %).

Observed (biased) confirmed case fatality risk of severe acute respiratory syndrome (SARS) in Hong Kong, 2003. (Nishiura et al., 2009)
Observed (biased) confirmed case fatality risk of severe acute respiratory syndrome (SARS) in Hong Kong, 2003. (Nishiura et al., 2009)

Nevertheless, even by only using the observed data for the period March 19 to April 2, cfr_static() can yield an appropriate prediction (Figure D), e.g. the delay-adjusted CFR at March 27 is 18.1 % (95% CI: 10.5, 28.1). An overestimation is seen in the very early stages of the epidemic, but the 95% confidence limits in the later stages include the realised CFR (i.e. 17.2 %).

Early determination of the delay-adjusted confirmed case fatality risk of severe acute respiratory syndrome (SARS) in Hong Kong, 2003. (Nishiura et al., 2009)
Early determination of the delay-adjusted confirmed case fatality risk of severe acute respiratory syndrome (SARS) in Hong Kong, 2003. (Nishiura et al., 2009)

Interpret the early-stage CFR estimate

Based on the figure above:

  • How much difference in days is between the date in which the 95% CI of the estimated delay-adjusted CFR vs naive CFR cross with the CFR estimated at the end of the outbreak?

Discuss:

  • What are the Public health policy implications of having a delay-adjusted CFR estimate?

We can use either visual inspection or analysis of the output data frames.

There is almost one month of difference.

Note that the estimate has considerable uncertainty at the beginning of the time series. After two weeks, the delay-adjusted CFR approaches the overall CFR estimate at the outbreak’s end.

Is this pattern similar to other outbreaks? We can use the data sets in this episode’s challenges. We invite you to find it out!

Checklist

With cfr, we estimate the CFR as the proportion of deaths among confirmed cases.

By only using confirmed cases, it is clear that all cases that do not seek medical treatment or are not notified will be missed, as well as all asymptomatic cases. This means that the CFR estimate is higher than the proportion of deaths among the infected.

cfr method aims to obtain an unbiased estimator “well before” observing the entire course of the outbreak. For this, cfr uses the underestimation factor \(u_{t}\) to estimate the unbiased CFR \(p_{t}\) using maximum-likelihood methods, given the sampling process defined by Nishiura et al., 2009.

The population of confirmed cases and sampling process for estimating the unbiased CFR during the course of an outbreak. (Nishiura et al., 2009)
The population of confirmed cases and sampling process for estimating the unbiased CFR during the course of an outbreak. (Nishiura et al., 2009)

From aggregated incidence data, at time \(t\) we know the cumulative number of confirmed cases and deaths, \(C_{t}\) and \(D_{t}\), and wish to estimate the unbiased CFR \(\pi\), by way of the factor of underestimation \(u_{t}\).

If we knew the factor of underestimation \(u_{t}\) we could specify the size of the population of confirmed cases no longer at risk (\(u_{t}C_{t}\), shaded), although we do not know which surviving individuals belong to this group. A proportion \(\pi\) of those in the group of cases still at risk (size \((1- u_{t})C_{t}\), unshaded) is expected to die.

Because each case no longer at risk had an independent probability of dying, \(\pi\), the number of deaths, \(D_{t}\), is a sample from a binomial distribution with sample size \(u_{t}C_{t}\), and probability of dying \(p_{t}\) = \(\pi\).

This is represented by the following likelihood function to obtain the maximum likelihood estimate of the unbiased CFR \(p_{t}\) = \(\pi\):

\[ {\sf L}(\pi | C_{t},D_{t},u_{t}) = \log{\dbinom{u_{t}C_{t}}{D_{t}}} + D_{t} \log{\pi} + (u_{t}C_{t} - D_{t})\log{(1 - \pi)}, \]

This estimation is performed by the internal function ?cfr:::estimate_severity().

  • The delay-adjusted CFR does not address all sources of error in data like the underdiagnosis of infected individuals.

Challenges


Aggregated data differ from linelists

Aggregated incidence data differs from linelist data, where each observation contains individual-level data.

R

outbreaks::ebola_sierraleone_2014 %>% as_tibble()

OUTPUT

# A tibble: 11,903 × 8
      id   age sex   status    date_of_onset date_of_sample district chiefdom   
   <int> <dbl> <fct> <fct>     <date>        <date>         <fct>    <fct>      
 1     1    20 F     confirmed 2014-05-18    2014-05-23     Kailahun Kissi Teng 
 2     2    42 F     confirmed 2014-05-20    2014-05-25     Kailahun Kissi Teng 
 3     3    45 F     confirmed 2014-05-20    2014-05-25     Kailahun Kissi Tonge
 4     4    15 F     confirmed 2014-05-21    2014-05-26     Kailahun Kissi Teng 
 5     5    19 F     confirmed 2014-05-21    2014-05-26     Kailahun Kissi Teng 
 6     6    55 F     confirmed 2014-05-21    2014-05-26     Kailahun Kissi Teng 
 7     7    50 F     confirmed 2014-05-21    2014-05-26     Kailahun Kissi Teng 
 8     8     8 F     confirmed 2014-05-22    2014-05-27     Kailahun Kissi Teng 
 9     9    54 F     confirmed 2014-05-22    2014-05-27     Kailahun Kissi Teng 
10    10    57 F     confirmed 2014-05-22    2014-05-27     Kailahun Kissi Teng 
# ℹ 11,893 more rows

Use incidence2 to rearrange your data

From the outbreaks package, load the MERS linelist of cases from the mers_korea_2015 object.

Rearrange your this linelist to fit into the cfr package input.

Estimate the delay-adjusted HFR using the corresponding distribution delay.

How to rearrange my input data?

Rearranging the input data for data analysis can take most of the time. To get ready-to-analyse aggregated incidence data, we encourage you to use incidence2!

First, in the Get started vignette from the incidence2 package, explore how to use the date_index argument when reading a linelist with dates in multiple column.

Then, refer to the cfr vignette on Handling data from {incidence2} on how to use the cfr::prepare_data() function from incidence2 objects.

R

# Load packages
library(cfr)
library(epiparameter)
library(incidence2)
library(outbreaks)
library(tidyverse)

# Access delay distribution
mers_delay <-
  epidist_db(
    disease = "mers",
    epi_dist = "onset to death",
    single_epidist = TRUE
  )

# Read linelist
mers_korea_2015$linelist %>%
  as_tibble() %>%
  select(starts_with("dt_"))

OUTPUT

# A tibble: 162 × 6
   dt_onset   dt_report  dt_start_exp dt_end_exp dt_diag    dt_death  
   <date>     <date>     <date>       <date>     <date>     <date>    
 1 2015-05-11 2015-05-19 2015-04-18   2015-05-04 2015-05-20 NA        
 2 2015-05-18 2015-05-20 2015-05-15   2015-05-20 2015-05-20 NA        
 3 2015-05-20 2015-05-20 2015-05-16   2015-05-16 2015-05-21 2015-06-04
 4 2015-05-25 2015-05-26 2015-05-16   2015-05-20 2015-05-26 NA        
 5 2015-05-25 2015-05-27 2015-05-17   2015-05-17 2015-05-26 NA        
 6 2015-05-24 2015-05-28 2015-05-15   2015-05-17 2015-05-28 2015-06-01
 7 2015-05-21 2015-05-28 2015-05-16   2015-05-17 2015-05-28 NA        
 8 2015-05-26 2015-05-29 2015-05-15   2015-05-15 2015-05-29 NA        
 9 NA         2015-05-29 2015-05-15   2015-05-17 2015-05-29 NA        
10 2015-05-21 2015-05-29 2015-05-16   2015-05-16 2015-05-29 NA        
# ℹ 152 more rows

R

# Use {incidence2} to count daily incidence
mers_incidence <- mers_korea_2015$linelist %>%
  # converto to incidence2 object
  incidence(date_index = c("dt_onset", "dt_death")) %>%
  # complete dates from first to last
  incidence2::complete_dates()

# Inspect incidence2 output
mers_incidence

OUTPUT

# incidence:  72 x 3
# count vars: dt_death, dt_onset
   date_index count_variable count
   <date>     <chr>          <int>
 1 2015-05-11 dt_death           0
 2 2015-05-11 dt_onset           1
 3 2015-05-12 dt_death           0
 4 2015-05-12 dt_onset           0
 5 2015-05-13 dt_death           0
 6 2015-05-13 dt_onset           0
 7 2015-05-14 dt_death           0
 8 2015-05-14 dt_onset           0
 9 2015-05-15 dt_death           0
10 2015-05-15 dt_onset           0
# ℹ 62 more rows

R

# Prepare data from {incidence2} to {cfr}
mers_incidence %>%
  prepare_data(
    cases_variable = "dt_onset",
    deaths_variable = "dt_death"
  )

OUTPUT

         date deaths cases
1  2015-05-11      0     1
2  2015-05-12      0     0
3  2015-05-13      0     0
4  2015-05-14      0     0
5  2015-05-15      0     0
6  2015-05-16      0     0
7  2015-05-17      0     1
8  2015-05-18      0     1
9  2015-05-19      0     0
10 2015-05-20      0     5
11 2015-05-21      0     6
12 2015-05-22      0     2
13 2015-05-23      0     4
14 2015-05-24      0     2
15 2015-05-25      0     3
16 2015-05-26      0     1
17 2015-05-27      0     2
18 2015-05-28      0     1
19 2015-05-29      0     3
20 2015-05-30      0     5
21 2015-05-31      0    10
22 2015-06-01      2    16
23 2015-06-02      0    11
24 2015-06-03      1     7
25 2015-06-04      1    12
26 2015-06-05      1     9
27 2015-06-06      0     7
28 2015-06-07      0     7
29 2015-06-08      2     6
30 2015-06-09      0     1
31 2015-06-10      2     6
32 2015-06-11      1     3
33 2015-06-12      0     0
34 2015-06-13      0     2
35 2015-06-14      0     0
36 2015-06-15      0     1

R

# Estimate delay-adjusted CFR
mers_incidence %>%
  prepare_data(
    cases_variable = "dt_onset",
    deaths_variable = "dt_death"
  ) %>%
  cfr_static(delay_density = function(x) density(mers_delay, x))

OUTPUT

  severity_mean severity_low severity_high
1         0.137        0.069          0.24

Severity heterogeneity

The CFR may differ across populations (e.g. age, space, treatment); quantifying these heterogeneities can help target resources appropriately and compare different care regimens (Cori et al., 2017).

Use the cfr::covid_data data frame to estimate a delay-adjusted CFR stratified by country.

One way to do a stratified analysis is to apply a model to nested data. This {tidyr} vignette shows you how to apply the group_by() + nest() to nest data, and then mutate() + map() to apply the model.

R

library(cfr)
library(epiparameter)
library(tidyverse)

covid_data %>% glimpse()

OUTPUT

Rows: 20,786
Columns: 4
$ date    <date> 2020-01-03, 2020-01-03, 2020-01-03, 2020-01-03, 2020-01-03, 2…
$ country <chr> "Argentina", "Brazil", "Colombia", "France", "Germany", "India…
$ cases   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ deaths  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

R

delay_onset_death <-
  epidist_db(
    disease = "covid",
    epi_dist = "onset to death",
    single_epidist = TRUE
  )

covid_data %>%
  group_by(country) %>%
  nest() %>%
  mutate(
    temp =
      map(
        .x = data,
        .f = cfr::cfr_static,
        delay_density = function(x) density(delay_onset_death, x)
      )
  ) %>%
  unnest(cols = temp)

OUTPUT

# A tibble: 19 × 5
# Groups:   country [19]
   country        data                 severity_mean severity_low severity_high
   <chr>          <list>                       <dbl>        <dbl>         <dbl>
 1 Argentina      <tibble [1,094 × 3]>         0.013        0.013         0.013
 2 Brazil         <tibble [1,094 × 3]>         0.019        0.019         0.019
 3 Colombia       <tibble [1,094 × 3]>         0.022        0.022         0.022
 4 France         <tibble [1,094 × 3]>         0.004        0.004         0.004
 5 Germany        <tibble [1,094 × 3]>         0.005        0.005         0.005
 6 India          <tibble [1,094 × 3]>         0.012        0.012         0.012
 7 Indonesia      <tibble [1,094 × 3]>         0.024        0.024         0.024
 8 Iran           <tibble [1,094 × 3]>         0.019        0.019         0.019
 9 Italy          <tibble [1,094 × 3]>         0.007        0.007         0.007
10 Mexico         <tibble [1,094 × 3]>         0.046        0.046         0.046
11 Peru           <tibble [1,094 × 3]>         0.05         0.05          0.05 
12 Poland         <tibble [1,094 × 3]>         0.019        0.019         0.019
13 Russia         <tibble [1,094 × 3]>         0.018        0.018         0.018
14 South Africa   <tibble [1,094 × 3]>         0.025        0.025         0.025
15 Spain          <tibble [1,094 × 3]>         0.009        0.009         0.009
16 Turkey         <tibble [1,094 × 3]>         0.006        0.006         0.006
17 Ukraine        <tibble [1,094 × 3]>         0.02         0.02          0.02 
18 United Kingdom <tibble [1,094 × 3]>         0.009        0.009         0.009
19 United States  <tibble [1,094 × 3]>         0.011        0.011         0.011

Great! Now you can use similar code for any other stratified analysis like age, regions or more!

But, how can we interpret that there is a country variability of severity from the same diagnosed pathogen?

Local factors like testing capacity, the case definition, and sampling regime can affect the report of cases and deaths, thus affecting case ascertainment. Take a look to the cfr vignette on Estimating the proportion of cases that are ascertained during an outbreak!

Appendix


The cfr package has a function called cfr_time_varying() with functionality that differs from cfr_rolling().

When to use cfr_rolling()?

cfr_rolling() shows the estimated CFR on each outbreak day, given that future data on cases and deaths is unavailable at the time. The final value of cfr_rolling() estimates is identical to cfr_static() on the same data.

Remember, as shown above, cfr_rolling() is helpful to get early-stage CFR estimates and check whether an outbreak’s CFR estimate has stabilised. Thus, cfr_rolling() is not sensitive to the length or size of the epidemic.

When to use cfr_time_varying()?

On the other hand, cfr_time_varying() calculates the CFR over a moving window and helps to understand changes in CFR due to changes in the epidemic, e.g. due to a new variant or increased immunity from vaccination.

However, cfr_time_varying() is sensitive to sampling uncertainty. Thus, it is sensitive to the size of the outbreak. The higher the number of cases with expected outcomes on a given day, the more reasonable estimates of the time-varying CFR we will get.

For example, with 100 cases, the fatality risk estimate will, roughly speaking, have a 95% confidence interval ±10% of the mean estimate (binomial CI). So if we have >100 cases with expected outcomes on a given day, we can get reasonable estimates of the time varying CFR. But if we only have >100 cases over the course of the whole epidemic, we probably need to rely on cfr_rolling() that uses the cumulative data.

We invite you to read this vignette about the cfr_time_varying() function.

More severity measures

Suppose we need to assess the clinical severity of the epidemic in a context different from surveillance data, like the severity among cases that arrive at hospitals or cases you collected from a representative serological survey.

Using cfr, we can change the inputs for the numerator (cases) and denominator (deaths) to estimate more severity measures like the Infection fatality risk (IFR) or the Hospitalisation Fatality Risk (HFR). We can follow this analogy:

If for a Case fatality risk (CFR), we require:

  • case and death incidence data, with a
  • case-to-death delay distribution (or close approximation, such as symptom onset-to-death).

Then, the Infection fatality risk (IFR) requires:

  • infection and death incidence data, with an
  • exposure-to-death delay distribution (or close approximation).

Similarly, the Hospitalisation Fatality Risk (HFR) requires:

  • hospitalisation and death incidence data, and a
  • hospitalisation-to-death delay distribution.

Yang et al., 2020 summarises different definitions and data sources:

Severity levels of infections with SARS-CoV-2 and parameters of interest. Each level is assumed to be a subset of the level below.
Severity levels of infections with SARS-CoV-2 and parameters of interest. Each level is assumed to be a subset of the level below.
  • sCFR symptomatic case-fatality risk,
  • sCHR symptomatic case-hospitalisation risk,
  • mCFR medically attended case-fatality risk,
  • mCHR medically attended case-hospitalisation risk,
  • HFR hospitalisation-fatality risk.
Data source of COVID-19 cases in Wuhan: D1) 32,583 laboratory-confirmed COVID-19 cases as of March 84, D2) 17,365 clinically-diagnosed COVID-19 cases during February 9–194, D3)daily number of laboratory-confirmed cases on March 9–April 243, D4) total number of COVID-19 deaths as of April 24 obtained from the Hubei Health Commission3, D5) 325 laboratory-confirmed cases and D6) 1290 deaths were added as of April 16 through a comprehensive and systematic verification by Wuhan Authorities3, and D7) 16,781 laboratory-confirmed cases identified through universal screening10,11. Pse: RT-PCR sensitivity12. Pmed.care: proportion of seeking medical assistance among patients suffering from acute respiratory infections13.
Schematic diagram of the baseline analyses. Red, blue, and green arrows denote the data flow from laboratory-confirmed cases of passive surveillance, clinically-diagnosed cases, and laboratory-confirmed cases of active screenings.

Discussion

Based on your experience:

  • Share any previous outbreak in which you participated in its response.

Answer to these questions:

  • How did you assess the clinical severity of the outbreak?
  • What were the primary sources of bias?
  • What did you do to take into account the identified bias?
  • What complementary analysis would you do to solve the bias?

Key Points

  • Use cfr to estimate severity

  • Use cfr_static() to estimate the overall CFR with the latest data available.

  • Use cfr_rolling() to show what the estimated CFR would be on each day of the outbreak.

  • Use the delay_density argument to adjust the CFR by the corresponding delay distribution.

Content from Account for superspreading


Last updated on 2024-05-01 | Edit this page

Estimated time: 32 minutes

Overview

Questions

  • How can we estimate individual-level variation in transmission (i.e. superspreading potential) from contact tracing data?
  • What are the implications for variation in transmission for decision-making?

Objectives

  • Estimate the distribution of onward transmission from infected individuals (i.e. offspring distribution) from outbreak data using epicontacts.
  • Estimate the extent of individual-level variation (i.e. the dispersion parameter) of the offspring distribution using fitdistrplus.
  • Estimate the proportion of transmission that is linked to ‘superspreading events’ using {superspreading}.

Prerequisites

Learners should familiarise themselves with following concept dependencies before working through this tutorial:

Statistics: common probability distributions, particularly Poisson and negative binomial.

Epidemic theory: The reproduction number, R.

Introduction


From smallpox to severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), some infected individuals spread infection to more people than others. Disease transmission is the result of a combination of biological and social factors, and these factors average out to some extent at the population level during a large epidemic. Hence researchers often use population averages to assess the potential for disease to spread. However, in the earlier or later phases of an outbreak, individual differences in infectiousness can be more important. In particular, they increase the chance of superspreading events (SSEs), which can ignite explosive epidemics and also influence the chances of controlling transmission (Lloyd-Smith et al., 2005).

Chains of SARS-CoV-2 transmission in Hong Kong initiated by local or imported cases. (a), Transmission network of a cluster of cases traced back to a collection of four bars across Hong Kong (n = 106). (b), Transmission network associated with a wedding without clear infector–infectee pairs but linked back to a preceding social gathering and local source (n = 22). (c), Transmission network associated with a temple cluster of undetermined source (n = 19). (d), All other clusters of SARS-CoV-2 infections where the source and transmission chain could be determined (Adam et al., 2020).
Chains of SARS-CoV-2 transmission in Hong Kong initiated by local or imported cases. (a), Transmission network of a cluster of cases traced back to a collection of four bars across Hong Kong (n = 106). (b), Transmission network associated with a wedding without clear infector–infectee pairs but linked back to a preceding social gathering and local source (n = 22). (c), Transmission network associated with a temple cluster of undetermined source (n = 19). (d), All other clusters of SARS-CoV-2 infections where the source and transmission chain could be determined (Adam et al., 2020).

The basic reproduction number, \(R_{0}\), measures the average number of cases caused by one infectious individual in a entirely susceptible population. Estimates of \(R_{0}\) are useful for understanding the average dynamics of an epidemic at the population-level, but can obscure considerable individual variation in infectiousness. This was highlighted during the global emergence of SARS-CoV-2 by numerous ‘superspreading events’ in which certain infectious individuals generated unusually large numbers of secondary cases (LeClerc et al, 2020).

R = 0.58 and k = 0.43.
Observed offspring distribution of SARS-CoV-2 transmission in Hong Kong. N = 91 SARS-CoV-2 infectors, N = 153 terminal infectees and N = 46 sporadic local cases. Histogram bars indicate the proportion of onward transmission per amount of secondary cases. Line corresponds to a fitted negative binomial distribution (Adam et al., 2020).

In this tutorial, we are going to quantify individual variation in transmission, and hence estimate the potential for superspreading events. Then we are going to use these estimates to explore the implications of superspreading for contact tracing interventions.

We are going to use data from the outbreaks package, manage the linelist and contacts data using epicontacts, and estimate distribution parameters with fitdistrplus. Lastly, we are going to use {superspreading} to explore the implications of variation in transmission for decision-making.

We’ll use the pipe %>% to connect some of the functions from these packages, so let’s also call the tidyverse package.

R

library(outbreaks)
library(epicontacts)
library(fitdistrplus)
library(superspreading)
library(tidyverse)

The double-colon

The double-colon :: in R let you call a specific function from a package without loading the entire package into the current environment.

For example, dplyr::filter(data, condition) uses filter() from the dplyr package.

This help us remember package functions and avoid namespace conflicts.

The individual reprodution number


The individual reproduction number is defined as the number of secondary cases caused by a particular infected individual.

Early in an outbreak we can use contact data to reconstruct transmission chains (i.e. who infected whom) and calculate the number of secondary cases generated by each individual. This reconstruction of linked transmission events from contact data can provide an understanding about how different individuals have contributed to transmission during an epidemic (Cori et al., 2017).

Let’s practice this using the mers_korea_2015 linelist and contact data from the outbreaks package and integrate them with the epicontacts package to calculate the distribution of secondary cases during the 2015 MERS-CoV outbreak in South Korea (Campbell, 2022):

R

## first, make an epicontacts object
epi_contacts <-
  epicontacts::make_epicontacts(
    linelist = outbreaks::mers_korea_2015$linelist,
    contacts = outbreaks::mers_korea_2015$contacts
  )

R

# visualise contact network
epicontacts::vis_epicontacts(epi_contacts)

Contact data from a transmission chain can provide information on which infected individuals came into contact with others. We expect to have the infector (from) and the infectee (to) plus additional columns of variables related to their contact, such as location (exposure) and date of contact.

Following tidy data principles, the observation unit in our contact dataset is the infector-infectee pair. Although one infector can infect multiple infectees, from contact tracing investigations we may record contacts linked to more than one infector (e.g. within a household). But we should expect to have unique infector-infectee pairs, because typically each infected person will have acquired the infection from one other.

To ensure these unique pairs, we can check on replicates for infectees:

R

# no infector-infectee pairs are replicated
epi_contacts %>%
  pluck("contacts") %>%
  group_by(to) %>%
  filter(n() > 1)

OUTPUT

# A tibble: 5 × 4
# Groups:   to [2]
  from  to     exposure       diff_dt_onset
  <chr> <chr>  <fct>                  <int>
1 SK_16 SK_107 Emergency room            17
2 SK_87 SK_107 Emergency room             2
3 SK_14 SK_39  Hospital room             16
4 SK_11 SK_39  Hospital room             13
5 SK_12 SK_39  Hospital room             12

When each infector-infectee row is unique, the number of entries per infector corresponds to the number of secondary cases generated by that individual.

R

# count secondary cases per infector
infector_secondary <- epi_contacts %>%
  pluck("contacts") %>%
  count(from, name = "secondary_cases")

But this output only contains number of secondary cases for reported infectors, not for each of the individuals in the whole epicontacts object.

To get this, first, we can use epicontacts::get_id() to get the full list of unique identifiers (“id”) from the epicontacts class object. Second, join it with the count secondary cases per infector stored in the infector_secondary object. Third, replace the missing values with 0 to express no report of secondary cases from them.

R

all_secondary <- epi_contacts %>%
  # extract ids in contact *and* linelist using "which" argument
  epicontacts::get_id(which = "all") %>%
  # transform vector to dataframe to use left_join()
  enframe(name = NULL, value = "from") %>%
  # join count secondary cases per infectee
  left_join(infector_secondary) %>%
  # infectee with missing secondary cases are replaced with zero
  replace_na(
    replace = list(secondary_cases = 0)
  )

From a histogram of the all_secondary object, we can identify the individual-level variation in the number of secondary cases. Three cases were related to more than 20 secondary cases, while the complementary cases with less than five or zero secondary cases.

R

## plot the distribution
all_secondary %>%
  ggplot(aes(secondary_cases)) +
  geom_histogram(binwidth = 1) +
  labs(
    x = "Number of secondary cases",
    y = "Frequency"
  )

The number of secondary cases can be used to empirically estimate the offspring distribution, which is the number of secondary infections caused by each case. One candidate statistical distribution used to model the offspring distribution is the negative binomial distribution with two parameters:

  • Mean, which represents the \(R_{0}\), the average number of (secondary) cases produced by a single individual in an entirely susceptible population, and

  • Dispersion, expressed as \(k\), which represents the individual-level variation in transmission by single individuals.

From the histogram and density plot, we can identify that the offspring distribution is highly skewed or overdispersed. In this framework, the superspreading events (SSEs) are not arbitrary or exceptional, but simply realizations from the right-hand tail of the offspring distribution, which we can quantify and analyse (Lloyd-Smith et al., 2005).

Terminology recap

  • From linelist and contact data, we calculate the number of secondary cases caused by the observed infected individuals.
  • Whereas \(R_{0}\) captures the average transmission in the population, we can define the individual reproduction number as a random variable representing the expected number of secondary cases caused by a infected individual.
  • From the stochastic effects in transmission, the number of secondary infections caused by each case is described by an offspring distribution.
  • An empirical offspring distribution can be modeled by the negative-binomial distribution with mean \(R_{0}\) and dispersion parameter \(k\).

For occurrences of associated discrete events we can use Poisson or negative binomial discrete distributions.

In a Poisson distribution, mean is equal to variance. But when variance is higher than the mean, this is called overdispersion. In biological applications, overdispersion occurs and so a negative binomial may be worth considering as an alternative to Poisson distribution.

Negative binomial distribution is specially useful for discrete data over an unbounded positive range whose sample variance exceeds the sample mean. In such terms, the observations are overdispersed with respect to a Poisson distribution, for which the mean is equal to the variance.

In epidemiology, negative binomial have being used to model disease transmission for infectious diseases where the likely number of onward infections may vary considerably from individual to individual and from setting to setting, capturing all variation in infectious histories of individuals, including properties of the biological (i.e. degree of viral shedding) and environmental circumstances (e.g. type and location of contact).

Challenge

Calculate the distribution of secondary cases for Ebola using the ebola_sim_clean object from outbreaks package.

Is the offspring distribution of Ebola skewed or overdispersed?

Note: This dataset has 5829 cases. Running epicontacts::vis_epicontacts() may overload your session!

R

## first, make an epicontacts object
ebola_contacts <-
  epicontacts::make_epicontacts(
    linelist = ebola_sim_clean$linelist,
    contacts = ebola_sim_clean$contacts
  )

# count secondary cases

ebola_infector_secondary <- ebola_contacts %>%
  pluck("contacts") %>%
  count(from, name = "secondary_cases")

ebola_secondary <- ebola_contacts %>%
  # extract ids in contact *and* linelist using "which" argument
  epicontacts::get_id(which = "all") %>%
  # transform vector to dataframe to use left_join()
  enframe(name = NULL, value = "from") %>%
  # join count secondary cases per infectee
  left_join(ebola_infector_secondary) %>%
  # infectee with missing secondary cases are replaced with zero
  replace_na(
    replace = list(secondary_cases = 0)
  )

## plot the distribution
ebola_secondary %>%
  ggplot(aes(secondary_cases)) +
  geom_histogram(binwidth = 1) +
  labs(
    x = "Number of secondary cases",
    y = "Frequency"
  )

From a visual inspection, the distribution of secondary cases for the Ebola data set in ebola_sim_clean shows an skewed distribution with secondary cases equal or lower than 6. We need to complement this observation with a statistical analysis to evaluate for overdispersion.

Estimate the dispersion parameter


To empirically estimate the dispersion parameter \(k\), we could fit a negative binomial distribution to the number of secondary cases.

We can fit distributions to data using the fitdistrplus package, which provides maximum likelihood estimates.

R

library(fitdistrplus)

## fit distribution
offspring_fit <- all_secondary %>%
  pull(secondary_cases) %>%
  fitdist(distr = "nbinom")

offspring_fit

OUTPUT

Fitting of the distribution ' nbinom ' by maximum likelihood 
Parameters:
       estimate  Std. Error
size 0.02039807 0.007278299
mu   0.60452947 0.337893199

Name of parameters

From the fitdistrplus output:

  • The size object refers to the estimated dispersion parameter \(k\), and
  • The mu object refers to the estimated mean, which represents the \(R_{0}\),

From the number secondary cases distribution we estimated a dispersion parameter \(k\) of 0.02, with a 95% Confidence Interval from 0.006 to 0.035. As the value of \(k\) is significantly lower than one, we can conclude that there is considerable potential for superspreading events.

We can overlap the estimated density values of the fitted negative binomial distribution and the histogram of the number of secondary cases:

Individual-level variation in transmission

The individual-level variation in transmission is defined by the relationship between the mean (\(R_{0}\)), dispersion (\(k\)), and the variance of a negative binomial distribution.

The negative binomial model has \(variance = R_{0}(1+\frac{R_{0}}{k})\), so smaller values of \(k\) indicate greater variance and, consequently, greater individual-level variation in transmission.

\[\uparrow variance = R_{0}(1+\frac{R_{0}}{\downarrow k})\]

When \(k\) approaches infinity (\(k \rightarrow \infty\)) the variance equals the mean (because \(\frac{R_{0}}{\infty}=0\)). This makes the conventional Poisson model an special case of the negative binomial model.

Challenge

Use the distribution of secondary cases from the ebola_sim_clean object from outbreaks package.

Fit a negative binomial distribution to estimate the mean and dispersion parameter of the offspring distribution.

Does the estimated dispersion parameter of Ebola provide evidence of an individual-level variation in transmission?

Review how we fitted a negative binomial distribution using the fitdistrplus::fitdist() function.

R

ebola_offspring <- ebola_secondary %>%
  pull(secondary_cases) %>%
  fitdist(distr = "nbinom")

ebola_offspring

OUTPUT

Fitting of the distribution ' nbinom ' by maximum likelihood 
Parameters:
     estimate  Std. Error
size 2.353899 0.250124606
mu   0.539300 0.009699219

R

## extract the "size" parameter
ebola_mid <- ebola_offspring$estimate[["size"]]

## calculate the 95% confidence intervals using the standard error estimate and
## the 0.025 and 0.975 quantiles of the normal distribution.

ebola_lower <- ebola_mid + ebola_offspring$sd[["size"]] * qnorm(0.025)

ebola_upper <- ebola_mid + ebola_offspring$sd[["size"]] * qnorm(0.975)

# ebola_mid
# ebola_lower
# ebola_upper

From the number secondary cases distribution we estimated a dispersion parameter \(k\) of 2.354, with a 95% Confidence Interval from 1.864 to 2.844.

For dispersion parameter estimates higher than one we get low distribution variance, hence, low individual-level variation in transmission.

But does this mean that the secondary case distribution does not have superspreading events (SSEs)? You will later find one additional challenge: How do you define an SSE threshold for Ebola?

We can use the maximum likelihood estimates from fitdistrplus to compare different models and assess fit performance using estimators like the AIC and BIC. Read further in the vignette on Estimate individual-level transmission and use the {superspreading} helper function ic_tbl() for this!

The dispersion parameter across diseases

Research into sexually transmitted and vector-borne diseases has previously suggested a ‘20/80’ rule, with 20% of individuals contributing at least 80% of the transmission potential (Woolhouse et al).

On its own, the dispersion parameter \(k\) is hard to interpret intuitively, and hence converting into proportional summary can enable easier comparison. When we consider a wider range of pathogens, we can see there is no hard and fast rule for the percentage that generates 80% of transmission, but variation does emerge as a common feature of infectious diseases

  • When the 20% most infectious cases contribute to the 80% of transmission (or more), there is a high individual-level variation in transmission, with a highly overdispersed offspring distribution (\(k<0.1\)), e.g., SARS-1.

  • When the 20% most infectious cases contribute to the ~50% of transmission, there is a low individual-level variation in transmission, with a moderately dispersed offspring distribution (\(k > 0.1\)), e.g. Pneumonic Plague.

Evidence for variation in individual reproductive number. (Left, c) Proportion of transmission expected from the most infectious 20% of cases, for 10 outbreak or surveillance data sets (triangles). Dashed lines show proportions expected under the 20/80 rule (top) and homogeneity (bottom). (Right, d), Reported superspreading events (SSEs; diamonds) relative to estimated reproductive number R (squares) for twelve directly transmitted infections. Crosses show the 99th-percentile proposed as threshold for SSEs. (More figure details in Lloyd-Smith et al., 2005)
Evidence for variation in individual reproductive number. (Left, c) Proportion of transmission expected from the most infectious 20% of cases, for 10 outbreak or surveillance data sets (triangles). Dashed lines show proportions expected under the 20/80 rule (top) and homogeneity (bottom). (Right, d), Reported superspreading events (SSEs; diamonds) relative to estimated reproductive number R (squares) for twelve directly transmitted infections. Crosses show the 99th-percentile proposed as threshold for SSEs. (More figure details in Lloyd-Smith et al., 2005)

Controlling superspreading with contact tracing


During an outbreak, it is common to try and reduce transmission by identifying people who have come into contact with an infected person, then quarantine them in case they subsequently turn out to be infected. Such contact tracing can be deployed in multiple ways. ‘Forward’ contact tracing targets downstream contacts who may have been infected by a newly identifed infection (i.e. the ‘index case’). ‘Backward’ tracing instead tracks the upstream primary case who infected the index case (or a setting or event at which the index case was infected), for example by retracing history of contact to the likely point of exposure. This makes it possible to identify others who were also potentially infected by this earlier primary case.

In the presence of individual-level variation in transmission, i.e., with an overdispersed offspring distribution, if this primary case is identified, a larger fraction of the transmission chain can be detected by forward tracing each of the contacts of this primary case (Endo et al., 2020).

Schematic representation of contact tracing strategies. Black arrows indicate the directions of transmission, blue and Orange arrows, a successful or failed contact tracing, respectivelly. When there is evidence of individual-level variation in transmission, often resulting in superspreading, backward contact tracing from the index case (blue circle) increase the probability to find the primary case (green circle) or clusters with a larger fraction of cases, potentially increasing the number of quarentined cases (yellow circles). Claire Blackmore, 2021
Schematic representation of contact tracing strategies. Black arrows indicate the directions of transmission, blue and Orange arrows, a successful or failed contact tracing, respectivelly. When there is evidence of individual-level variation in transmission, often resulting in superspreading, backward contact tracing from the index case (blue circle) increase the probability to find the primary case (green circle) or clusters with a larger fraction of cases, potentially increasing the number of quarentined cases (yellow circles). Claire Blackmore, 2021

When there is evidence of individual-level variation (i.e. overdispersion), often resulting in so-called superspreading events, a large proportion of infections may be linked to a small proportion of original clusters. As a result, finding and targeting originating clusters in combination with reducing onwards infection may substantially enhance the effectiveness of tracing methods (Endo et al., 2020).

Empirical evidence focused on evaluating the efficiency of backward tracing lead to 42% more cases identified than forward tracing supporting its implementation when rigorous suppression of transmission is justified (Raymenants et al., 2022)

Probability of cases in a given cluster


Using {superspreading}, we can estimate the probability of having a cluster of secondary infections caused by a primary case identified by backward tracing of size \(X\) or larger (Endo et al., 2020).

R

# Set seed for random number generator
set.seed(33)

# estimate the probability of
# having a cluster size of 5, 10, or 25
# secondary cases from a primary case,
# given known reproduction number and
# dispersion parameter.
proportion_cluster_size(
  R = offspring_fit$estimate["mu"],
  k = offspring_fit$estimate["size"],
  cluster_size = c(5, 10, 25)
)

OUTPUT

          R          k prop_5 prop_10 prop_25
1 0.6045295 0.02039807  87.9%   74.6%   46.1%

Even though we have an \(R<1\), a highly overdispersed offspring distribution (\(k=0.02\)) means that if we detect a new case, there is a 46.1% probability they originated from a cluster of 25 infections or more. Hence, by following a backwards strategy, contact tracing efforts will increase the probability of successfully contain and quarantining this large number of earlier infected individuals, rather than simply focusing on the new case, who is likely to have infected nobody (because \(k\) is very small).

We can also use this number to prevent gathering of certain sized to reduce the epidemic by preventing potential superspreading events. Interventions can target to reduce the reproduction number in order to reduce the probability of having clusters of secondary cases.

Backward contact tracing for Ebola

Use the Ebola estimated parameters for ebola_sim_clean object from outbreaks package.

Calculate the probability of having a cluster of secondary infections caused by a primary case identified by backward tracing of size 5, 10, 15 or larger.

Would implementing a backward strategy at this stage of the Ebola outbreak increase the probability of containing and quarantining more onward cases?

Review how we estimated the probability of having clusters of a fixed size, given an offspring distribution mean and dispersion parameters, using the superspreading::proportion_cluster_size() function.

R

# estimate the probability of
# having a cluster size of 5, 10, or 25
# secondary cases from a primary case,
# given known reproduction number and
# dispersion parameter.
proportion_cluster_size(
  R = ebola_offspring$estimate["mu"],
  k = ebola_offspring$estimate["size"],
  cluster_size = c(5, 10, 25)
)

OUTPUT

       R        k prop_5 prop_10 prop_25
1 0.5393 2.353899   1.8%      0%      0%

The probability of having clusters of five people is 1.8%. At this stage, given this offspring distribution parameters, a backward strategy may not increase the probability of contain and quarantine more onward cases.

Challenges


Does Ebola have any superspreading event?

‘Superspreading events’ can mean different things to different people, so Lloyd-Smith et al., 2005 proposed a general protocol for defining a superspreading event (SSE). If the number of secondary infections caused by each case, \(Z\), follows a negative binomial distribution (\(R, k\)):

  • We define an SSE as any infected individual who infects more than \(Z(n)\) others, where \(Z(n)\) is the nth percentile of the \(Poisson(R)\) distribution.
  • A 99th-percentile SSE is then any case causing more infections than would occur in 99% of infectious histories in a homogeneous population

Using the corresponding distribution function, estimate the SSE threshold to define a SSE for the Ebola offspring distribution estimates for the ebola_sim_clean object from outbreaks package.

In a Poisson distribution, the lambda or rate parameter are equal to the estimated mean from a negative binomial distribution. You can explore this in The distribution zoo shiny app.

To get the quantile value for the 99th-percentile we need to use the density function for the Poisson distribution dpois().

R

# get mean
ebola_mu_mid <- ebola_offspring$estimate["mu"]

# get 99th-percentile from poisson distribution
# with mean equal to mu
qpois(
  p = 0.99,
  lambda = ebola_mu_mid
)

OUTPUT

[1] 3

Compare this values with the ones reported by Lloyd-Smith et al., 2005. See figure below:

Reported superspreading events (SSEs; diamonds) relative to estimated reproduction number R (squares) for twelve directly transmitted infections. Lines show 5–95 percentile range of the number of secondary cases following a Poisson distribution with lambda equal to the reproduction number (Z∼Poisson(R)), and crosses show the 99th-percentile proposed as threshold for SSEs. Stars represent SSEs caused by more than one source case. ‘Other’ diseases are: 1, Streptococcus group A; 2, Lassa fever; 3, Mycoplasma pneumonia; 4, pneumonic plague; 5, tuberculosis. R is not shown for ‘other’ diseases, and is off-scale for monkeypox.
Reported superspreading events (SSEs; diamonds) relative to estimated reproduction number R (squares) for twelve directly transmitted infections. Lines show 5–95 percentile range of the number of secondary cases following a Poisson distribution with lambda equal to the reproduction number (\(Z∼Poisson(R)\)), and crosses show the 99th-percentile proposed as threshold for SSEs. Stars represent SSEs caused by more than one source case. ‘Other’ diseases are: 1, Streptococcus group A; 2, Lassa fever; 3, Mycoplasma pneumonia; 4, pneumonic plague; 5, tuberculosis. R is not shown for ‘other’ diseases, and is off-scale for monkeypox.

Expected proportion of transmission

What is the proportion of cases responsible for 80% of transmission?

Use {superspreading} and compare the estimates for MERS using the offspring distributions parameters from this tutorial episode, with SARS-1 and Ebola offspring distributions parameter accessible via the {epiparameter} R package.

To use superspreading::proportion_transmission() we recommend to read the Estimate what proportion of cases cause a certain proportion of transmission reference manual.

Currently, {epiparameter} has offspring distributions for SARS, Smallpox, Mpox, Pneumonic Plague, Hantavirus Pulmonary Syndrome, Ebola Virus Disease. Let’s access the offspring distribution mean and dispersion parameters for SARS-1:

R

# Load parameters
sars <- epiparameter::epidist_db(
  disease = "SARS",
  epi_dist = "offspring distribution",
  single_epidist = TRUE
)
sars_params <- epiparameter::get_parameters(sars)
sars_params

OUTPUT

      mean dispersion 
      1.63       0.16 

R

#' estimate for ebola --------------

ebola_epiparameter <- epiparameter::epidist_db(
  disease = "Ebola",
  epi_dist = "offspring distribution",
  single_epidist = TRUE
)
ebola_params <- epiparameter::get_parameters(ebola_epiparameter)
ebola_params

OUTPUT

      mean dispersion 
       1.5        5.1 

R

# estimate
# proportion of cases that
# generate 80% of transmission
proportion_transmission(
  R = ebola_params[["mean"]],
  k = ebola_params[["dispersion"]],
  percent_transmission = 0.8
)

OUTPUT

    R   k prop_80
1 1.5 5.1   43.2%

R

#' estimate for sars --------------

# estimate
# proportion of cases that
# generate 80% of transmission
proportion_transmission(
  R = sars_params[["mean"]],
  k = sars_params[["dispersion"]],
  percent_transmission = 0.8
)

OUTPUT

     R    k prop_80
1 1.63 0.16     13%

R

#' estimate for mers --------------

# estimate
# proportion of cases that
# generate 80% of transmission
proportion_transmission(
  R = offspring_fit$estimate["mu"],
  k = offspring_fit$estimate["size"],
  percent_transmission = 0.8
)

OUTPUT

          R          k prop_80
1 0.6045295 0.02039807    2.1%

MERS has the lowest percent of cases (2.1%) responsible of the 80% of the transmission, representative of highly overdispersed offspring distributions.

Ebola has the highest percent of cases (43%) responsible of the 80% of the transmission. This is representative of offspring distributions with high dispersion parameters.

inverse-dispersion?

The dispersion parameter \(k\) can be expressed differently across the literature.

  • In the Wikipedia page for the negative binomial, this parameter is defined in its reciprocal form (refer to the variance equation).
  • In the distribution zoo shiny app, the dispersion parameter \(k\) is named “Inverse-dispersion” but it is equal to parameter estimated in this episode. We invite you to explore this!

heterogeneity?

The individual-level variation in transmission is also referred as the heterogeneity in the transmission or degree of heterogeneity in Lloyd-Smith et al., 2005, heterogeneous infectiousness in Campbell et al., 2018 when introducing the {outbreaker2} package. Similarly, a contact network can store heterogeneous epidemiological contacts as in the documentation of the epicontacts package (Nagraj et al., 2018).

Read these blog posts

The Tracing monkeypox article from the JUNIPER consortium showcases the usefulness of network models on contact tracing.

The Going viral post from Adam Kucharski shares insides from YouTube virality, disease outbreaks, and marketing campaigns on the conditions that spark contagion online.

Key Points

  • Use epicontacts to calculate the number of secondary cases cause by a particular individual from linelist and contact data.
  • Use fitdistrplus to empirically estimate the offspring distribution from the number of secondary cases distribution.
  • Use {superspreading} to estimate the probability of having clusters of a given size from primary cases and inform contact tracing efforts.

Content from Simulate transmission chains


Last updated on 2024-05-04 | Edit this page

Estimated time: 32 minutes

Overview

Questions

  • How can we simulate transmission chains based on infection characteristics?

Objectives

  • Create a short term projection using a branching process with {epichains}.

Prerequisites

Learners should familiarise themselves with the following concept dependencies before working through this tutorial:

Statistics: Common probability distributions, including Poisson and negative binomial.

Epidemic theory: The reproduction number, \(R\).

Introduction


Individual variation in transmission can affect both the potential for an epidemic to establish in a population and the ease of control (Cori et al., 2017).

  • Greater variation reduces the overall probably of a single new case causing a large local outbreak, because most cases infect few others and individuals that generate a large number of secondary cases are relatively rare.

  • However, if a ‘superspreading event’ does occur and the outbreak gets established, this variation can make an outbreak harder to control using mass interventions (i.e. blanket interventions that implicitly assume everyone contributes equally to transmission), because some cases contribute disproportionality: a single uncontrolled case may generate a large number of secondary cases.

  • Conversely, variation in transmission may provide opportunities for targeted interventions if the individuals who contribute more to transmission (due to biological or behavioural factors), or the settings in which ‘superspreading events’ occur, share socio-demographic, environmental or geographical characteristics that can be defined.

How can we quantify the potential of a new infection to cause a large outbreak based on its reproduction number \(R\) and the dispersion \(k\) of its offspring distribution?

Observed number of cumulative cases from the Middle East respiratory syndrome (MERS) outbreak in South Korea, 2015, alongside with simulated transmission chains assuming an offspring distribution with $R=0.6$ and $k=0.02$.
Observed number of cumulative cases from the Middle East respiratory syndrome (MERS) outbreak in South Korea, 2015, alongside with simulated transmission chains assuming an offspring distribution with \(R=0.6\) and \(k=0.02\).

In this episode, we will use the {epichains} package to simulate transmission chains and estimate the potential for large outbreaks following the introduction of a new case. We are going to use it with functions from {epiparameter}, dplyr and purrr, so also loading the tidyverse package:

R

library(epichains)
library(epiparameter)
library(tidyverse)

The double-colon

The double-colon :: in R let you call a specific function from a package without loading the entire package into the current environment.

For example, dplyr::filter(data, condition) uses filter() from the dplyr package.

This help us remember package functions and avoid namespace conflicts.

Simulation of uncontrolled outbreaks


Infectious disease epidemics spread through populations when a chain of infected individuals transmit the infection to others. Branching processes can be used to model this transmission. A branching process is a stochastic process (i.e. a random process that can be described by a known probability distribution), where each infectious individual gives rise to a random number of individuals in the next generation of infection, starting with the index case in generation 1. The distribution of the number of secondary cases each individual generates is called the offspring distribution (Azam & Funk, 2024).

{epichains} provides methods to analyse and simulate the size and length of branching processes with an given offspring distribution. {epichains} implements a rapid and simple model to simulate transmission chains to assess epidemic risk, project cases into the future, and evaluate interventions that change \(R\).

chain size and length

  • The size of the transmission chain is defined as the total number of individuals infected across all generations of infection, and

  • the length of the transmission chain is the number of generations from the first case to the last case in the outbreak before the chain ended.

The size calculation includes the first case, and the length calculation contains the first generation when the first case starts the chain (See figure below).

An example of a transmission chain starting with a single case C1. Cases are represented by blue circles and arrows indicating who infected whom. The chain grows through generations Gen 1, Gen 2, and Gen 3, producing cases C2, C3, C4, C5, and C6. The chain ends at generation Gen 3 with cases C4, C5, and C6. The size of C1’s chain is 6, including C1 (that is, the sum of all blue circles), and the length is 3, which includes Gen 1 (maximum number of generations reached by C1’s chain) (Azam & Funk, 2024).
An example of a transmission chain starting with a single case C1. Cases are represented by blue circles and arrows indicating who infected whom. The chain grows through generations Gen 1, Gen 2, and Gen 3, producing cases C2, C3, C4, C5, and C6. The chain ends at generation Gen 3 with cases C4, C5, and C6. The size of C1’s chain is 6, including C1 (that is, the sum of all blue circles), and the length is 3, which includes Gen 1 (maximum number of generations reached by C1’s chain) (Azam & Funk, 2024).

To use {epichains}, we need to know (or assume) two key epidemiological values: the offspring distribution and the generation time.

Get the offspring distribution


Here we assume the MERS offspring distribution follows a negative binomial distribution, with mean (reproduction number \(R\)) and dispersion \(k\) values estimated from the linelist and contact data of mers_korea_2015 in the outbreaks R package in the previous episode.

R

mers_offspring <- c(mean = 0.60, dispersion = 0.02)

offspring distribution for epichains

We input an offspring distribution to {epichains} by referring to the R function that generates random values from the distribution we want. For a negative binomial distribution, we use rnbinom with its corresponding mu and size arguments:

R

  offspring_dist = rnbinom,
  mu = mers_offspring["mean"],
  size = mers_offspring["dispersion"],

The reference manual in ?rnbinom tells us our required specific arguments.

{epichains} can accept any R function that generates random numbers, so the specified arguments will change depending on the R function used. For more details on the range of possible options, see the function reference manual.

For example, let’s say we want to use a Poisson distribution for the offspring distribution. First, read the argument required in the ?rpois reference manual. Second, specify the lambda argument parameter, also known as rate or mean in the literature. In {epichains}, this can look like this:

R

  offspring_dist = rpois,
  lambda = mers_offspring["mean"],

In this example, we can specify lambda = mers_offspring["mean"] because the mean number of secondary cases generated (i.e. \(R\)) should be the same regardless of the distribution we assume. What changes is the variance of the distribution, and hence the level of individual-level variation in transmission. When the dispersion parameter \(k\) approaches infinity (\(k \rightarrow \infty\)) in a negative binomial distribution, the variance equals the mean. This makes the conventional Poisson distribution a special case of the negative binomial distribution.

Get generation time


The serial interval distribution is often used to approximate the generation time distribution. This approximation is commonly used because it is easier to observe and measure the onset of symptoms in each case than the precise time of infection.

A schematic of the relationship of different time periods of transmission between an infector and an infectee in a transmission pair. Exposure window is defined as the time interval having viral exposure, and transmission window is defined as the time interval for onward transmission with respect to the infection time (Chung Lau et al., 2021).
A schematic of the relationship of different time periods of transmission between an infector and an infectee in a transmission pair. Exposure window is defined as the time interval having viral exposure, and transmission window is defined as the time interval for onward transmission with respect to the infection time (Chung Lau et al., 2021).

However, using the serial interval as an approximation of the generation time is primarily valid for diseases in which infectiousness starts after symptom onset (Chung Lau et al., 2021). In cases where infectiousness starts before symptom onset, the serial intervals can have negative values, which is the case for diseases with pre-symptomatic transmission (Nishiura et al., 2020).

Let’s use the {epiparameter} package to access and use the available serial interval for MERS disease:

R

serial_interval <- epidist_db(
  disease = "mers",
  epi_dist = "serial",
  single_epidist = TRUE
)

plot(serial_interval, day_range = 0:25)

The serial interval for MERS has a mean of 12.6 days and a standard deviation of 2.8 days.

generation time for epichains

In {epichains}, we need to specify the generation time as a function that generates random numbers. Using {epiparameter} has the advantage of using the distribution function epiparameter::generate() for this input. This will look like this:

R

function(x) generate(x = serial_interval, times = x)

This interface is similar to the one cfr uses to link with {epiparameter}. Read the work with delay distributions vignette for further context.

Simulate a single chain


Now we are prepared to use the simulate_chains() function from {epichains} to create one transmission chain:

R

simulate_chains(
  # simulation controls
  index_cases = 5,
  statistic = "size",
  # offspring
  offspring_dist = rnbinom,
  mu = mers_offspring["mean"],
  size = mers_offspring["dispersion"],
  # generation
  generation_time = function(x) generate(x = serial_interval, times = x)
)

simulate_chains() requires three sets of arguments as a minimum:

  • simulation controls,
  • offspring distribution, and
  • generation time.

In the lines above, we described how to specify the offspring distribution and generation time. The simulation controls include at least two arguments:

  • index_case, which defines the number of index cases to simulate transmission chains for and
  • statistic, which defines a chain statistic to track (either "size" or "length") as the stopping criteria for each chain being simulated.

Stopping criteria

This is an customisable feature of {epichains}. By default, branching process simulations end when they have gone extinct. For long-lasting transmission chains, in simulate_chains() you can add the stat_max argument.

For example, if we set an stopping criteria for statistic = "size" of stat_max = 500, no more offspring will be produced after a chain of size 500.

The simulate_chains() output creates a <epichains> class object, which we can then analyse further in R.

Simulate multiple chains


We can use simulate_chains() to create multiple chains and increase the probability of simulating uncontrolled outbreak projections given an overdispersed offspring distribution.

We need to specify three additional elements:

  • set.seed(<integer>), which is a random number generator function with a specified seed value, the <integer> number, to ensure consistent results across different runs of the code.
  • number_chains, which defines the number of simulations to run.
  • initial_cases defines the number of initial cases to input to the index_cases argument explained in the lines above.

R

# Set seed for random number generator
set.seed(33)
# Number of simulation runs
number_chains <- 1000
# Number of initial cases
initial_cases <- 1

number_chains and initial_cases are conveniently stored in objects to facilitate downstream reuse in the workflow.

Iteration using purrr

Iteration aims to perform the same action on different objects repeatedly.

Learn how to use the core purrr functions like map() from the YouTube tutorial on How to purrr by Equitable Equations.

Or, if you previously used the *apply family of functions, visit the package vignette on purrr base R, which shares key differences, direct translations, and examples.

To get multiple chains, we must apply the simulate_chains() function to each chain defined by a sequence of numbers from 1 to 1000.

purrr and epichains

First, let’s sketch how we use purrr::map() with epichains::simulate_chains(). The map() function requires two arguments:

  • .x, with a vector of numbers, and
  • .f, a function to iterate to each vector value.

R

map(
  # vector of numbers (chain IDs)
  .x = seq_len(number_chains),
  # function to iterate to each chain ID number
  .f = function(sim) {
    simulate_chains(...) %>%
      # creates a column with the chain ID number
      mutate(chain_id = sim)
  }
) %>%
  # combine list outputs (for each chain ID) into a single data frame
  list_rbind()

The sim element is placed to register the iteration number (chain ID) as a new column in the <epichains> output. The purrr::list_rbind() function aims to combine all the list outputs from map().

Why a dot (.) as a prefix? In the tidy design principles book we have a chapter on the dot prefix!

Now, we are prepared to use map() to repeatedly simulate from simulate_chains() and store in a vector from 1 to 1000:

R

simulated_chains_map <-
  # iterate one function across multiple numbers (chain IDs)
  map(
    # vector of numbers (chain IDs)
    .x = seq_len(number_chains),
    # function to iterate to each chain ID number
    .f = function(sim) {
      simulate_chains(
        # simulation controls
        index_cases = initial_cases,
        statistic = "size",
        # offspring
        offspring_dist = rnbinom,
        mu = mers_offspring["mean"],
        size = mers_offspring["dispersion"],
        # generation
        generation_time = function(x) generate(x = serial_interval, times = x)
      ) %>%
        # creates a column with the chain ID number
        mutate(chain_id = sim)
    }
  ) %>%
  # combine list outputs (for each chain ID) into a single data frame
  list_rbind()

Read the epichains output

To explore the output format of the <epichains> class object of name simulated_chains_map, let’s look at the simulated chain_id number 806.

Let’s use dplyr::filter() for this:

R

chain_to_observe <- 806

R

#### get epichain summary ----------------------------------------------------

simulated_chains_map %>%
  filter(chain_id == chain_to_observe)

OUTPUT

`<epichains>` object

< tree head (from first known infector_id) >

  infectee_id sim_id infector_id generation     time chain_id
2           1      2           1          2 16.38623      806
3           1      3           1          2 11.79430      806
4           1      4           1          2 10.77252      806
5           1      5           1          2 11.39945      806
6           1      6           1          2 10.23130      806
7           1      7           2          3 26.01046      806


Number of infectors (known): 3
Number of generations: 3
Use `as.data.frame(<object_name>)` to view the full output in the console.

Key elements from this output are in the footer, the piece of text that appears at the bottom:

OUTPUT

Number of infectors (known): 3
Number of generations: 3

The simulated chain_id number 806 has three known infectors and three generations. These numbers are more visible when reading the <epichains> objects as a data frame.

R

#### infector-infectee data frame --------------------------------------------

simulated_chains_map %>%
  filter(chain_id == chain_to_observe) %>%
  as_tibble()

OUTPUT

# A tibble: 9 × 6
  infectee_id sim_id infector_id generation  time chain_id
        <int>  <dbl>       <dbl>      <int> <dbl>    <int>
1           1      1          NA          1   0        806
2           1      2           1          2  16.4      806
3           1      3           1          2  11.8      806
4           1      4           1          2  10.8      806
5           1      5           1          2  11.4      806
6           1      6           1          2  10.2      806
7           1      7           2          3  26.0      806
8           1      8           2          3  29.8      806
9           1      9           2          3  26.6      806

Chain 806 tells us a story: “In the first transmission generation at time = 0, one index case infected the first case with sim_id = 1. Then, in the second transmission generation (between time 10 to 16), sim_id = 1 infected five cases. Later, in the third transmission generation (between time 26 to 30), sim_id = 2 infected three new cases.”

The output data frame collects infectees as the observation unit:

  • Each infectee has a sim_id.
  • Each infectee that behaved as an infector is registered in the infector_id column using sim_id of that infectee.
  • Each infectee got infected in a specific generation and (continuous) time.
  • The simulation number is registered under the chain_id column.

Note: The Number of infectors (known) includes the NA observation under the infector_id column. This refers to the infector specified as index case (in the index_cases argument), which started the transmission chain to the infectee of sim_id = 1, at generation = 1, and time = 0.

Visualize multiple chains


To visualize the simulated chains, we need some pre-processing:

  1. Let’s use dplyr to get round time numbers to resemble surveillance days.
  2. Count the daily cases in each simulation (by chain_id).
  3. Calculate the cumulative number of cases within a simulation.

R

# daily aggregate of cases
simulated_chains_day <- simulated_chains_map %>%
  # use data.frame output from <epichains> object
  as_tibble() %>%
  # transform chain ID column to factor (categorical variable)
  mutate(chain_id = as_factor(chain_id)) %>%
  # get the round number (day) of infection times
  mutate(day = ceiling(time)) %>%
  # count the daily number of cases in each simulation (chain ID)
  count(chain_id, day, name = "cases") %>%
  # calculate the cumulative number of cases for each simulation (chain ID)
  group_by(chain_id) %>%
  mutate(cases_cumsum = cumsum(cases)) %>%
  ungroup()

Before the plot, let’s create a summary table with the total time duration and size of each chain. We can use the dplyr “combo” of group_by(), summarise() and ungroup():

R

# Summarise the chain duration and size
sim_chains_max <-
  simulated_chains_day %>%
  group_by(chain_id) %>%
  summarise(
    # duration
    day_max = max(day),
    # size
    cases_total = max(cases_cumsum)
  ) %>%
  ungroup()

sim_chains_max

OUTPUT

# A tibble: 1,000 × 3
   chain_id day_max cases_total
   <fct>      <dbl>       <int>
 1 1              0           1
 2 2              0           1
 3 3              0           1
 4 4              0           1
 5 5              0           1
 6 6              0           1
 7 7              0           1
 8 8              0           1
 9 9              0           1
10 10             0           1
# ℹ 990 more rows

Now, we are prepared for using the ggplot2 package:

R

# Visualize transmission chains by cumulative cases
ggplot() +
  # create grouped chain trajectories
  geom_line(
    data = simulated_chains_day,
    mapping = aes(
      x = day,
      y = cases_cumsum,
      group = chain_id
    ),
    color = "black",
    alpha = 0.25,
    show.legend = FALSE
  ) +
  # create points to visualize the chain endpoint
  geom_point(
    data = sim_chains_max,
    mapping = aes(
      x = day_max,
      y = cases_total,
      group = chain_id,
      color = chain_id
    ),
    show.legend = FALSE
  ) +
  # define a 100-case threshold
  geom_hline(aes(yintercept = 100), lty = 2) +
  labs(
    x = "Day",
    y = "Cumulative cases"
  )

Although most introductions of 1 index case do not generate secondary cases (N = 940) or most outbreaks rapidly become extinct (median duration of 17.5 and median size of 9), only 2 epidemic trajectories among 1000 simulations (0.2%) can reach to more than 100 infected cases. This finding is particularly remarkable because the reproduction number \(R\) is less than 1 (offspring distribution mean of 0.6), but, given an offspring distribution dispersion parameter of 0.02, it shows the potential for explosive outbreaks of MERS disease.

We can count how many chains reached the 100-case threshold using dplyr functions:

R

# number of chains that reached the 100-case threshold
sim_chains_max %>%
  arrange(desc(day_max)) %>%
  filter(cases_total > 100)

OUTPUT

# A tibble: 2 × 3
  chain_id day_max cases_total
  <fct>      <dbl>       <int>
1 666           88         110
2 23            45         113

Let’s overlap the cumulative number of observed cases using the linelist object from the mers_korea_2015 dataset of the outbreaks R package. To prepare the dataset so we can plot daily total cases over time, we use incidence2 to convert the linelist to an <incidence2> object, complete the missing dates of the time series with complete_dates()

R

library(outbreaks)

mers_cumcases <- mers_korea_2015$linelist %>%
  # incidence2 workflow
  incidence2::incidence(date_index = "dt_onset") %>%
  incidence2::complete_dates() %>%
  # wrangling using {dplyr}
  mutate(count_cumsum = cumsum(count)) %>%
  rownames_to_column(var = "day") %>%
  mutate(day = as.numeric(day))

Use plot() to make an incidence plot:

R

# plot the incidence2 object
plot(mers_cumcases)

When plotting the observed number of cumulative cases from the Middle East respiratory syndrome (MERS) outbreak in South Korea in 2015 alongside the previously simulated chains, we see that the observed cases followed a trajectory that is consistent with the simulated explosive outbreak dynamics (which makes sense, given the simulation uses parameters based on this specific outbreak).

When we increase the dispersion parameter from \(k = 0.01\) to \(k = \infty\) - and hence reduce individual-level variation in transmission - and assume a fixed reproduction number \(R = 1.5\), the proportion of simulated outbreaks that reached the 100-case threshold increases. This is because the simulated outbreaks now have more of a consistent, clockwise dynamic, rather than the high level of variability seen previously.

Growth of simulated outbreaks with R = 1.5 and one initial case, conditional on non-extinction. Boxes show the median and interquartile range (IQR) of the first disease generation with 100 cases; whiskers show the most extreme values within 1.5 × IQR of the boxes, and crosses show outliers. Percentages show the proportion of 10,000 simulated outbreaks that reached the 100-case threshold (Lloyd-Smith et al., 2005).
Growth of simulated outbreaks with R = 1.5 and one initial case, conditional on non-extinction. Boxes show the median and interquartile range (IQR) of the first disease generation with 100 cases; whiskers show the most extreme values within 1.5 × IQR of the boxes, and crosses show outliers. Percentages show the proportion of 10,000 simulated outbreaks that reached the 100-case threshold (Lloyd-Smith et al., 2005).

Early spread projections

In the epidemic’s initial phase, you can use {epichains} to apply a branching process model to project the number of future cases. Even though the model accounts for randomness in transmission and variation in the number of secondary cases, there may be additional local features we have not considered. Analysis of early forecasts made for COVID in different countries using this model structure found that predictions were often overconfident (Pearson et al., 2020). This is likely because the real-time model did not include all the changes in the offspring distribution that were happening at the local level as a result of behaviour change and control measures. You can read more about the importance of local context in COVID-19 models in Eggo et al. (2020).

We invite you to read the vignette on Projecting infectious disease incidence: a COVID-19 example! for more on making predictions using {epichains}.

Challenges


Monkeypox large outbreak potential

Evaluate the potential for a new Monkey pox (Mpox) case to generate an explosive large outbreak.

  • Simulate 1000 transmission chains with 1 initial case each.
  • Use the appropriate package to access delay data from previous outbreaks.
  • How many simulated trajectories reach more than 100 infected cases?

With {epiparameter}, you can access and use offspring and delay distributions from previous Ebola outbreaks.

R

library(epiparameter)
library(tidyverse)

epidist_db(epi_dist = "offspring") %>%
  list_distributions() %>%
  count(disease, epi_distribution)

OUTPUT

                        disease       epi_distribution n
1           Ebola Virus Disease offspring distribution 1
2 Hantavirus Pulmonary Syndrome offspring distribution 1
3                          Mpox offspring distribution 1
4              Pneumonic Plague offspring distribution 1
5                          SARS offspring distribution 2
6                      Smallpox offspring distribution 4

R

epidist_db(epi_dist = "serial interval") %>%
  list_distributions() %>%
  count(disease, epi_distribution)

OUTPUT

                disease epi_distribution n
1              COVID-19  serial interval 4
2   Ebola Virus Disease  serial interval 4
3             Influenza  serial interval 1
4                  MERS  serial interval 2
5 Marburg Virus Disease  serial interval 2
6                  Mpox  serial interval 5

R

# load packages -----------------------------------------------------------

library(epiparameter)
library(epichains)
library(tidyverse)

# delays ------------------------------------------------------------------

mpox_offspring_epidist <- epidist_db(
  disease = "mpox",
  epi_dist = "offspring",
  single_epidist = TRUE
)

mpox_offspring <- get_parameters(mpox_offspring_epidist)

mpox_serialint <- epidist_db(
  disease = "mpox",
  epi_dist = "serial interval",
  single_epidist = TRUE
)

# iterate -----------------------------------------------------------------

# Set seed for random number generator
set.seed(33)
# Number of simulation runs
number_chains <- 1000
# Number of initial cases
initial_cases <- 1

simulated_chains_mpox <-
  # iterate one function across multiple numbers (chain IDs)
  map(
    # vector of numbers (chain IDs)
    .x = seq_len(number_chains),
    # function to iterate to each chain ID number
    .f = function(sim) {
      simulate_chains(
        # simulation controls
        index_cases = initial_cases,
        statistic = "size",
        # offspring
        offspring_dist = rnbinom,
        mu = mpox_offspring["mean"],
        size = mpox_offspring["dispersion"],
        # generation
        generation_time = function(x) generate(x = mpox_serialint, times = x)
      ) %>%
        # creates a column with the chain ID number
        mutate(chain_id = sim)
    }
  ) %>%
  # combine list outputs (for each chain ID) into a single data frame
  list_rbind()

# visualize ---------------------------------------------------------------

# daily aggregate of cases
simulated_chains_mpox_day <- simulated_chains_mpox %>%
  # use data.frame output from <epichains> object
  as_tibble() %>%
  # transform chain ID column to factor (categorical variable)
  mutate(chain_id = as_factor(chain_id)) %>%
  # get the round number (day) of infection times
  mutate(day = ceiling(time)) %>%
  # count the daily number of cases in each simulation (chain ID)
  count(chain_id, day, name = "cases") %>%
  # calculate the cumulative number of cases for each simulation (chain ID)
  group_by(chain_id) %>%
  mutate(cases_cumsum = cumsum(cases)) %>%
  ungroup()

# Visualize transmission chains by cumulative cases
ggplot() +
  # create grouped chain trajectories
  geom_line(
    data = simulated_chains_mpox_day,
    mapping = aes(
      x = day,
      y = cases_cumsum,
      group = chain_id
    ),
    color = "black",
    alpha = 0.25,
    show.legend = FALSE
  ) +
  # define a 100-case threshold
  geom_hline(aes(yintercept = 100), lty = 2) +
  labs(
    x = "Day",
    y = "Cumulative cases"
  )

Assuming a Monkey pox outbreak with \(R\) = 0.32 and \(k\) = 0.58, there is no trajectory among 1000 simulations that reach more than 100 infected cases. Compared to MERS (\(R\) = 0.6 and \(k\) = 0.02).

With {superspreading}, you can get numerical solutions to processes that {epichains} solve using branching processes. We invite you to read the {superspreading} vignette on Epidemic risk and respond to:

  • What is the probability that a newly introduced pathogen will cause a large outbreak?
  • What is the probability that an infection will, by chance, fail to establish following initial introduction(s)?
  • What is the probability the outbreak will be contained?

Check how these estimates vary non-linearly with respect to the mean reproduction number \(R\) and dispersion \(k\) of a given disease.

From a distribution of secondary cases

Christian Althaus, 2015 reused data published by Faye et al., 2015 (Figure 2) on the transmission tree on Ebola virus disease in Conakry, Guinea, 2014.

Using the data under the hint tab, estimate the offspring distribution from the distribution of secondary cases. Then estimate the large outbreak potential from this data.

Code with the transmission tree data written by Christian Althaus, 2015:

R

# Number of individuals in the trees
n <- 152
# Number of secondary cases for all individuals
c1 <- c(1, 2, 2, 5, 14, 1, 4, 4, 1, 3, 3, 8, 2, 1, 1,
        4, 9, 9, 1, 1, 17, 2, 1, 1, 1, 4, 3, 3, 4, 2,
        5, 1, 2, 2, 1, 9, 1, 3, 1, 2, 1, 1, 2)
c0 <- c(c1, rep(0, n - length(c1)))

c0 %>%
  enframe() %>%
  ggplot(aes(value)) +
  geom_histogram()

R

# load packages ---------------------------
library(fitdistrplus)
library(epiparameter)
library(epichains)
library(tidyverse)

# fit a negative binomial distribution ------------------------------------

# Fitting a negative binomial distribution to the number of secondary cases
fit.cases <- fitdist(c0, "nbinom")
fit.cases

OUTPUT

Fitting of the distribution ' nbinom ' by maximum likelihood 
Parameters:
      estimate Std. Error
size 0.1814260 0.03990278
mu   0.9537995 0.19812301

R

# serial interval parameters ----------------------------------------------

ebola_serialinter <- epidist_db(
  disease = "ebola",
  epi_dist = "serial interval",
  single_epidist = TRUE
)

# simulate outbreak trajectories ------------------------------------------

# Set seed for random number generator
set.seed(645)
# Number of simulation runs
number_chains <- 1e2
# Number of initial cases
initial_cases <- 1

sim_multiple_chains <-
  map(
    .x = seq_len(number_chains),
    .f = function(sim) {
      simulate_chains(
        index_cases = initial_cases,
        # stopping
        statistic = "size",
        # offspring
        offspring_dist = rnbinom,
        mu = fit.cases$estimate["mu"],
        size = fit.cases$estimate["size"],
        # generation
        generation_time = function(x) generate(x = ebola_serialinter, times = x)
      ) %>%
        mutate(simulation_n = sim)
    }
  ) %>%
  # combine list outputs (for each chain ID) into a single data frame
  list_rbind()

# visualize ----------------------------------------

sim_chains_aggregate <-
  sim_multiple_chains %>%
  as_tibble() %>%
  mutate(simulation_n = as_factor(simulation_n)) %>%
  mutate(day = ceiling(time)) %>%
  count(simulation_n, day, name = "cases") %>%
  group_by(simulation_n) %>%
  mutate(cases_cumsum = cumsum(cases)) %>%
  ungroup()

ggplot() +
  geom_line(
    data = sim_chains_aggregate,
    mapping = aes(
      x = day,
      y = cases_cumsum,
      group = simulation_n
    ),
    show.legend = FALSE
  ) +
  # define a 100-case threshold
  geom_hline(aes(yintercept = 100), lty = 2)

Remarkably, even with R0 less than 1 (R = 0.95) we can have potentially explosive outbreaks. The observed variation in individual infectiousness in Ebola means that although the probability of extinction is high, new index cases also have the potential for explosive regrowth of the epidemic.

Key Points

  • Use {epichains} to simulate the large outbreak potential of diseases with overdispersed offspring distributions.