Last updated on 2025-05-06 | Edit this page
Week 3: Estimate superspreading and simulate transmission chains
This practical is based in the following tutorial episodes:
- https://epiverse-trace.github.io/tutorials-middle/superspreading-estimate.html
- https://epiverse-trace.github.io/tutorials-middle/superspreading-simulate.html
Welcome!
- A reminder of our Code of Conduct. If you experience or witness unacceptable behaviour, or have any other concerns, please notify the course organisers or host of the event. To report an issue involving one of the organisers, please use the LSHTM’s Report and Support tool.
Read This First
Instructions:
- Each
Activity
has five sections: the Goal, Questions, Inputs, Your Code, and Your Answers. - Solve each Activity in the corresponding
.R
file mentioned in theYour Code
section. - Paste your figure and table outputs and write your answer to the
questions in the section
Your Answers
. - Choose one group member to share your group’s results with the rest of the participants.
During the practical, instead of simply copying and pasting, we encourage learners to increase their fluency writing R by using:
- The double-colon notation, e.g.
package::function()
to specify which package a function comes from, avoid namespace conflicts, and find functions using keywords. - Tab key ↹ to autocomplete package or function names and display possible arguments.
-
Execute
one line of code or multiple lines connected by the pipe operator
(
%>%
) by placing the cursor in the code of interest and pressing theCtrl
+Enter
. -
R
shortcuts to insert the pipe operator (
%>%
) usingCtrl/Cmd
+Shift
+M
, or insert the assignment operator (<-
) usingAlt/Option
+-
.
If your local configuration was not possible to setup:
- Create one copy of the Posit Cloud RStudio project.
Practical
This practical has two activities.
Activity 1: Account for superspreading
Estimate extent of individual-level variation (i.e. the dispersion parameter) of the offspring distribution, which refers to the variability in the number of secondary cases per individual, and the proportion of transmission that is linked to ‘superspreading events’ using the following available inputs:
- Line list of cases
- Contact tracing data
As a group, Write your answers to these questions:
- From descriptive and estimation steps:
- What set has more infections related to fewer clusters in the contact network?
- What set has the most skewed histogram of secondary cases?
- Does the estimated dispersion parameter correlate with the contact network and histogram of secondary cases?
- On decision making:
- What is the proportion of new cases originating from a cluster of at least 10 cases?
- Would you recommend a backward tracing strategy?
- Interpret: How would you communicate these results to a decision-maker?
- Compare: What differences do you identify from other group outputs? (if available)
Inputs
Solution
Code
Set 1 (sample)
R
# Load packages -----------------------------------------------------------
library(epicontacts)
library(fitdistrplus)
library(tidyverse)
# Read linelist and contacts ----------------------------------------------
dat_contacts <- readr::read_rds(
"https://epiverse-trace.github.io/tutorials-middle/data/set-01-contacts.rds" #<DIFFERENT PER GROUP>
)
dat_linelist <- readr::read_rds(
"https://epiverse-trace.github.io/tutorials-middle/data/set-01-linelist.rds" #<DIFFERENT PER GROUP>
)
# Create an epicontacts object -------------------------------------------
epi_contacts <-
epicontacts::make_epicontacts(
linelist = dat_linelist,
contacts = dat_contacts,
directed = TRUE
)
epi_contacts
# visualize the contact network
contact_network <- epicontacts::vis_epicontacts(epi_contacts)
contact_network
# Count secondary cases per subject in contacts and linelist --------
secondary_cases <- epicontacts::get_degree(
x = epi_contacts,
type = "out",
only_linelist = TRUE
)
# plot the histogram of secondary cases
individual_reproduction_num <- secondary_cases %>%
enframe() %>%
ggplot(aes(value)) +
geom_histogram(binwidth = 1) +
labs(
x = "Number of secondary cases",
y = "Frequency"
)
individual_reproduction_num
# Fit a negative binomial distribution -----------------------------------
offspring_fit <- secondary_cases %>%
fitdistrplus::fitdist(distr = "nbinom")
offspring_fit
# Estimate proportion of new cases from a cluster of secondary cases -----
# Set seed for random number generator
set.seed(33)
# Estimate the proportion of new cases originating from
# a transmission cluster of at least 5, 10, or 25 cases
proportion_cases_by_cluster_size <-
superspreading::proportion_cluster_size(
R = offspring_fit$estimate["mu"],
k = offspring_fit$estimate["size"],
cluster_size = c(5, 10, 25)
)
proportion_cases_by_cluster_size
Outputs
Group 1
contact network | histogram of secondary cases |
---|---|
![]() |
![]() |
Group 2
contact network | histogram of secondary cases |
---|---|
![]() |
![]() |
Group 3
contact network | histogram of secondary cases |
---|---|
![]() |
![]() |
Group 1/2/3
R
#> R k prop_5 prop_10 prop_25
#> 1 0.8 0.01 95.1% 89.8% 75.1%
#> 2 0.8 0.10 66.7% 38.7% 7.6%
#> 3 0.8 0.50 25.1% 2.8% 0%
Interpretation
Interpretation template:
- For R = 0.8 and k = 0.01:
- The proportion of new cases originating from a cluster of at least 5 secondary cases from a primary case is 95%
- The proportion of all transmission events that were part of secondary case clusters (i.e., from the same primary case) of at least 5 cases is 95%
Interpretation Helpers:
- From the contact network, set 1 has the highest frequency of infections related with a small proportion of clusters.
- From the histogram of secondary cases, skewness in set 1 is higher than set 2 and set 3.
- Set 1 has cases with the highest number of secondary cases (n = 50), compared with set 2 (n = ~25) and set 3 (n = 11).
- The contact networks and histograms of secondary cases correlate with the estimated dispersion parameters: A small proportion of clusters generating most of new cases produces a more skewed histogram, and a lowest estimate of dispersion parameter.
- About probability of new cases from transmission cluster of size at
least 10 cases, and the recommending backward tracing strategy:
- set 1: 89%, yes.
- set 2: 38%, probably no?
- set 3: 3%, no.
Activity 2: Simulate transmission chains
Estimate the potential for large outbreaks that could occur based on 1000 simulated outbreaks using the following available inputs:
- Basic reproduction number
- Dispersion parameter
As a group, Write your answers to these questions:
- You have been assigned to explore
Chain ID
. From the output data frame, describe:- How many generations there are.
- Who infected whom, and when (with reference to the day of infection).
- Among simulated outbreaks:
- How many chains reached a 100 case threshold?
- What is the maximum size of chain? (The cumulative number of case)
- What is the maximum length of chain? (The number of days until the chain stops)
- Interpret: How would you communicate these results to a decision-maker?
- Compare: What differences do you identify from other group outputs? (if available)
Inputs
Group | Parameters | Chain ID |
---|---|---|
1 | R = 0.8, k = 0.01 | 957 |
2 | R = 0.8, k = 0.1 | 281 |
3 | R = 0.8, k = 0.5 | 38 |
4 | R = 1.5, k = 0.01 | 261 |
5 | R = 1.5, k = 0.1 | 325 |
6 | R = 1.5, k = 0.5 | 591 |
Solution
Code
Set 1 (sample)
R
# Load packages -----------------------------------------------------------
library(epiparameter)
library(epichains)
library(tidyverse)
# Set input parameters ---------------------------------------------------
known_basic_reproduction_number <- 0.8 #<DIFFERENT PER GROUP>
known_dispersion <- 0.01 #<DIFFERENT PER GROUP>
chain_to_observe <- 957 #<DIFFERENT PER GROUP>
# Set iteration parameters -----------------------------------------------
# Create generation time as <epiparameter> object
generation_time <- epiparameter::epiparameter(
disease = "disease x",
epi_name = "generation time",
prob_distribution = "gamma",
summary_stats = list(mean = 3, sd = 1)
)
# Simulate multiple chains -----------------------------------------------
# run set.seed() and epichains::simulate_chains() together, in the same run
# Set seed for random number generator
set.seed(33)
multiple_chains <- epichains::simulate_chains(
# simulation controls
n_chains = 1000, # number of chains to simulate
statistic = "size",
stat_threshold = 500, # stopping criteria
# offspring
offspring_dist = rnbinom,
mu = known_basic_reproduction_number,
size = known_dispersion,
# generation
generation_time = function(x) generate(x = generation_time, times = x)
)
multiple_chains
# Explore suggested chain ------------------------------------------------
multiple_chains %>%
# use data.frame output from <epichains> object
as_tibble() %>%
filter(chain == chain_to_observe) %>%
print(n = Inf)
# visualize ---------------------------------------------------------------
# daily aggregate of cases
aggregate_chains <- multiple_chains %>%
as_tibble() %>%
# count the daily number of cases in each chain
mutate(day = ceiling(time)) %>%
count(chain, day, name = "cases") %>%
# calculate the cumulative number of cases for each chain
group_by(chain) %>%
mutate(cumulative_cases = cumsum(cases)) %>%
ungroup()
# Visualize transmission chains by cumulative cases
aggregate_chains %>%
# create grouped chain trajectories
ggplot(aes(x = day, y = cumulative_cases, group = chain)) +
geom_line(color = "black", alpha = 0.25, show.legend = FALSE) +
# define a 100-case threshold
geom_hline(aes(yintercept = 100), lty = 2) +
labs(x = "Day", y = "Cumulative cases")
# count chains over 100 cases
aggregate_chains %>%
filter(cumulative_cases >= 100) %>%
count(chain)
# distribution of size of chains
aggregate_chains %>%
filter(cumulative_cases >= 100) %>%
skimr::skim(cumulative_cases)
# distribution of lenght of chains
aggregate_chains %>%
filter(cumulative_cases >= 100) %>%
skimr::skim(day)
Outputs
Group 1
contact network | secondary cases | simulated chains |
---|---|---|
![]() |
![]() |
![]() |
Group 2
contact network | secondary cases | simulated chains |
---|---|---|
![]() |
![]() |
![]() |
Group 3
contact network | secondary cases | simulated chains |
---|---|---|
![]() |
![]() |
![]() |
Sample
R
# infector-infectee data frame
simulated_chains_map %>%
dplyr::filter(simulation_id == 806) %>%
dplyr::as_tibble()
Interpretation
Interpretation template:
- Simulation
806
have1
chain with3
known infectors (NA
, 1, 2), and3
generations. - In the generation 0, subject
NA
infected subject 1. - In the generation 1, subject 1 infected subjects 2, 3, 4, 5, 6. These infections occurred between day 10 and 16 after the “case zero”.
- In the generation 2, subject 2 infected subjects 7, 8, 9. These infections occurred between day 26 and 29 after the “case zero”.
Interpretation Helpers:
From the plot of cumulative cases by day for each simulated chain:
Group | Parameters | Number of Chains Above 100 | Max Chain Size | Max Chain Length |
---|---|---|---|---|
1 | R = 0.8, k = 0.01 | 10 | ~200 | ~20 days |
2 | R = 0.8, k = 0.1 | 8 | ~420 | ~60 days |
3 | R = 0.8, k = 0.5 | 3 | ~180 | ~70 days |
4 | R = 1.5, k = 0.01 | 16 | ~840 | ~20 days |
5 | R = 1.5, k = 0.1 | 65 | ~890 | ~50 days |
6 | R = 1.5, k = 0.5 | 216 | ~850 | ~90 days |