Content from Read case data
Last updated on 2024-04-29 | Edit this page
Overview
Questions
- Where do you usually store your outbreak data?
- How many different data formats can I read?
- Is it possible to import data from database and health APIs?
Objectives
- Explain how to import outbreak data from different sources into
R
environment for analysis.
Introduction
The initial step in outbreak analysis involves importing the target
dataset into the R
environment from various sources.
Outbreak data is typically stored in files of diverse formats,
relational database management systems (RDBMS), or health information
system (HIS) application program interfaces (APIs) such as REDCap, DHIS2, etc. The latter option is
particularly well-suited for storing institutional health data. This
episode will elucidate the process of reading cases from these
sources.
Reading from files
Several packages are available for importing outbreak data stored in
individual files into R
. These include rio, readr from the
tidyverse
, io, ImportExport,
data.table.
Together, these packages offer methods to read single or multiple files
in a wide range of formats.
The below example shows how to import a csv
file into
R
environment using rio package.
R
library("rio")
library("here")
# read data
# e.g.: if path to file is data/raw-data/ebola_cases.csv then:
ebola_confirmed <- read_csv(here::here("data", "raw-data", "ebola_cases.csv"))
# preview data
head(ebola_confirmed, 5)
OUTPUT
date confirm
1 2014-05-18 1
2 2014-05-20 2
3 2014-05-21 4
4 2014-05-22 6
5 2014-05-23 1
Similarly, you can import files of other formats such as
tsv
, xlsx
, etc.
You can check the full list of supported file formats in the rio package on the package website. Here is a selection of some key ones:
R
rio::install_formats()
R
rio::import(here::here("some", "where", "downto", "path", "file_name.zip"))
Click here to download a zip file containing data for Marburg outbreak and then import it to your working environment.
Reading from databases
The DBI package serves as a versatile interface for interacting with database management systems (DBMS) across different back-ends or servers. It offers a uniform method for accessing and retrieving data from various database systems.
The following code chunk demonstrates how to create a temporary
SQLite database in memory, store the case_data
as a table
within it, and subsequently read from it:
R
library("DBI")
library("RSQLite")
# Create a temporary SQLite database in memory
db_con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
# Store the 'case_data' dataframe as a table named 'cases'
# in the SQLite database
DBI::dbWriteTable(db_con, "cases", case_data)
# Read data from the 'cases' table
result <- DBI::dbReadTable(db_con, "cases")
# Close the database connection
DBI::dbDisconnect(db_con)
# View the result
base::print(utils::head(result))
OUTPUT
date confirm
1 16208 1
2 16210 2
3 16211 4
4 16212 6
5 16213 1
6 16214 2
This code first establishes a connection to an SQLite database
created in memory using dbConnect()
. Then, it writes the
case_data
into a table named ‘cases’ within the database
using the dbWriteTable()
function. Subsequently, it reads
the data from the ‘cases’ table using dbReadTable()
.
Finally, it closes the database connection with
dbDisconnect()
. Read this tutorial
episode on SQL databases and R for more examples.
Run SQL queries in R using dbplyr
A database interface package optimize memory usage by processing the database before extraction, reducing memory load. Conversely, conducting all data manipulation outside the database (e.g., in our local Rstudio session) can lead to inefficient memory usage and strained system resources.
Read the Introduction to dbplyr vignette to learn how to generate your own queries!
Reading from HIS APIs
Health related data are also increasingly stored in specialized HIS
APIs like Fingertips, GoData,
REDCap, and DHIS2. In such case one
can resort to readepi package,
which enables reading data from HIS-APIs.
-[TBC]
Content from Clean and validate
Last updated on 2024-04-29 | Edit this page
Overview
Questions
- How to clean and standardize case data?
- How to convert raw dataset into a
linelist
object?
Objectives
- Explain how to clean, curate, and standardize case data using
{cleanepi}
package - Demonstrate how to covert case data to
linelist
data
Introduction
In the process of analyzing outbreak data, it’s essential to ensure that the dataset is clean, curated, standardized, and validate to facilitate accurate and reproducible analysis. This episode focuses on cleaning epidemics and outbreaks data using the cleanepi package, and validate it using the linelist package. For demonstration purposes, we’ll work with a simulated dataset of Ebola cases.
The first step is to import the dataset following the guidelines outlined in the Read case data episode. This involves loading the dataset into our environment and view its structure and content.
R
# Load packages
library("rio")
library("here")
# Read data
# e.g.: if path to file is data/raw-data/simulated_ebola_2.csv then:
raw_ebola_data <- rio::import(
here::here("data", "raw-data", "simulated_ebola_2.csv")
)
R
# Return first five rows
utils::head(raw_ebola_data, 5)
OUTPUT
V1 case id age gender status date onset date sample
1 1 14905 90 1 confirmed 03/15/2015 06/04/2015
2 2 13043 twenty-five 2 Sep /11/Y 03/01/2014
3 3 14364 54 f <NA> 09/02/2014 03/03/2015
4 4 14675 ninety <NA> 10/19/2014 31/ 12 /14
5 5 12648 74 F 08/06/2014 10/10/2016
A quick inspection
Quick exploration and inspection of the dataset are crucial before
diving into any analysis tasks. The {cleanepi}
package
simplifies this process with the scan_data()
function.
Let’s take a look at how you can use it:
R
library("cleanepi")
cleanepi::scan_data(raw_ebola_data)
OUTPUT
Field_names missing numeric date character logical
1 V1 0.000000 1.0000 0.000000 0.000000 0
2 case id 0.000000 1.0000 0.000000 0.000000 0
3 age 0.064600 0.8348 0.000000 0.100600 0
4 gender 0.157867 0.0472 0.000000 0.794933 0
5 status 0.053533 0.0000 0.000000 0.946467 0
6 date onset 0.000067 0.0000 0.915733 0.084200 0
7 date sample 0.000133 0.0000 0.999867 0.000000 0
The results provides an overview of the content of every column, including column names, and the percent of some data types per column. You can see that the column names in the dataset are descriptive but lack consistency, as some they are composed of multiple words separated by white spaces. Additionally, some columns contain more than one data type, and there are missing values in others.
Common operations
This section demonstrate how to perform some common data cleaning
operations using the {cleanepi}
package.
Standardizing column names
For this example dataset, standardizing column names typically
involves removing spaces and connecting different words with “_”. This
practice helps maintain consistency and readability in the dataset.
However, the function used for standardizing column names offers more
options. Type ?cleanepi::standardize_column_names
for more
details.
R
sim_ebola_data <- cleanepi::standardize_column_names(raw_ebola_data)
names(sim_ebola_data)
OUTPUT
[1] "v_1" "case_id" "age" "gender" "status"
[6] "date_onset" "date_sample"
If you want to maintain certain column names without subjecting them
to the standardization process, you can utilize the keep
parameter of the standardize_column_names()
function. This
parameter accepts a vector of column names that are intended to be kept
unchanged.
Exercise: Standardize the column names of the input dataset, but keep the “V1” column as is.
Removing irregularities
Raw data may contain irregularities such as duplicated and empty rows
and columns, as well as constant columns. remove_duplicates
and remove_constants
functions from {cleanepi}
remove such irregularities as demonstrated in the below code chunk.
R
sim_ebola_data <- cleanepi::remove_constant(sim_ebola_data)
sim_ebola_data <- cleanepi::remove_duplicates(sim_ebola_data)
Note that, our simulated Ebola does not contain duplicated nor constant rows or columns.
Replacing missing values
In addition to the regularities, raw data can contain missing values
that may be encoded by different strings, including the empty. To ensure
robust analysis, it is a good practice to replace all missing values by
NA
in the entire dataset. Below is a code snippet
demonstrating how you can achieve this in {cleanepi}
:
R
sim_ebola_data <- cleanepi::replace_missing_values(sim_ebola_data)
Validating subject IDs
Each entry in the dataset represents a subject and should be
distinguishable by a specific column formatted in a particular way, such
as falling within a specified range, containing certain prefixes and/or
suffixes, containing a specific number of characters. The
{cleanepi}
package offers the
check_subject_ids
function designed precisely for this task
as shown in the below code chunk. This function validates whether they
are unique and meet the required criteria.
R
# remove this chunk code once {cleanepi} is updated.
# The coercion made here will be accounted for within {cleanepi}
sim_ebola_data$case_id <- as.character(sim_ebola_data$case_id)
R
sim_ebola_data <- cleanepi::check_subject_ids(sim_ebola_data,
target_columns = "case_id",
range = c(0, 15000)
)
OUTPUT
Found 1957 duplicated rows. Please consult the report for more details.
Note that our simulated dataset does contain duplicated subject IDS.
Standardizing dates
Certainly an epidemic dataset contains date columns for different
events, such as the date of infection, date of symptoms onset, ..etc,
and these dates can come in different date forms, and it good practice
to unify them. The {cleanepi}
package provides
functionality for converting date columns in epidemic datasets into ISO
format, ensuring consistency across the different date columns. Here’s
how you can use it on our simulated dataset:
R
sim_ebola_data <- cleanepi::standardize_dates(
sim_ebola_data,
target_columns = c(
"date_onset",
"date_sample"
)
)
utils::head(sim_ebola_data)
OUTPUT
v_1 case_id age gender status date_onset date_sample
1 1 14905 90 1 confirmed 2015-03-15 2015-04-06
2 2 13043 twenty-five 2 <NA> <NA> 2014-01-03
3 3 14364 54 f <NA> 2014-02-09 2015-03-03
4 4 14675 ninety <NA> <NA> 2014-10-19 2014-12-31
5 5 12648 74 F <NA> 2014-06-08 2016-10-10
6 6 14274 seventy-six female <NA> <NA> 2016-01-23
This function coverts the values in the target columns, or will
automatically figure out the date columns within the dataset (if
target_columns = NULL
) and convert them into the
Ymd format.
Converting to numeric values
In the raw dataset, some column can come with mixture of character
and numerical values, and you want to covert the character values
explicitly into numeric. For example, in our simulated data set, in the
age column some entries are written in words. The
convert_to_numeric()
function in {cleanepi}
does such conversion as illustrated in the below code chunk.
R
sim_ebola_data <- cleanepi::convert_to_numeric(sim_ebola_data,
target_columns = "age"
)
utils::head(sim_ebola_data)
OUTPUT
v_1 case_id age gender status date_onset date_sample
1 1 14905 90 1 confirmed 2015-03-15 2015-04-06
2 2 13043 25 2 <NA> <NA> 2014-01-03
3 3 14364 54 f <NA> 2014-02-09 2015-03-03
4 4 14675 90 <NA> <NA> 2014-10-19 2014-12-31
5 5 12648 74 F <NA> 2014-06-08 2016-10-10
6 6 14274 76 female <NA> <NA> 2016-01-23
Multiple operations at once
Performing data cleaning operations individually can be
time-consuming and error-prone. The {cleanepi}
package
simplifies this process by offering a convenient wrapper function called
clean_data()
, which allows you to perform multiple
operations at once.
The clean_data()
function applies a series of predefined
data cleaning operations to the input dataset. Here’s an example code
chunk illustrating how to use clean_data()
on a raw
simulated Ebola dataset:
Further more, you can combine multiple data cleaning tasks via the pipe operator in “|>”, as shown in the below code snippet.
R
# remove the line below once Karim has updated cleanepi
raw_ebola_data$`case id` <- as.character(raw_ebola_data$`case id`)
# PERFORM THE OPERATIONS USING THE pipe SYNTAX
cleaned_data <- raw_ebola_data |>
cleanepi::standardize_column_names(keep = "V1", rename = NULL) |>
cleanepi::replace_missing_values(target_columns = NULL) |>
cleanepi::remove_constant(cutoff = 1.0) |>
cleanepi::remove_duplicates(target_columns = NULL) |>
cleanepi::standardize_dates(
target_columns = c("date_onset", "date_sample"),
error_tolerance = 0.4,
format = NULL,
timeframe = NULL
) |>
cleanepi::check_subject_ids(
target_columns = "case_id",
range = c(1, 15000)
) |>
cleanepi::convert_to_numeric(target_columns = "age") |>
cleanepi::clean_using_dictionary(dictionary = test_dict)
OUTPUT
Found 1957 duplicated rows. Please consult the report for more details.
Printing the clean report
The {cleanepi}
package generates a comprehensive report
detailing the findings and actions of all data cleansing operations
conducted during the analysis. This report is presented as a webpage
with multiple sections. Each section corresponds to a specific data
cleansing operation, and clicking on each section allows you to access
the results of that particular operation. This interactive approach
enables users to efficiently review and analyze the outcomes of
individual cleansing steps within the broader data cleansing
process.
You can view the report using cleanepi::print_report()
function.
Validating and tagging case data
In outbreak analysis, once you have completed the initial steps of reading and cleaning the case data, it’s essential to establish an additional foundational layer to ensure the integrity and reliability of subsequent analyses. Specifically, this involves verifying the presence and correct data type of certain input columns within your dataset, a process commonly referred to as “tagging.” Additionally, it’s crucial to implement measures to validate that these tagged columns are not inadvertently deleted during further data processing steps.
This is achieved by converting the cleaned case data into a
linelist
object using linelist package, see
the below code chunk.
R
library("linelist")
data <- linelist::make_linelist(cleaned_data,
id = "case_id",
age = "age",
date_onset = "date_onset",
date_reporting = "date_sample",
gender = "gender"
)
utils::head(data, 7)
OUTPUT
// linelist object
V1 case_id age gender status date_onset date_sample
1 1 14905 90 male confirmed 2015-03-15 2015-04-06
2 2 13043 25 female <NA> <NA> 2014-01-03
3 3 14364 54 female <NA> 2014-02-09 2015-03-03
4 4 14675 90 <NA> <NA> 2014-10-19 2014-12-31
5 5 12648 74 female <NA> 2014-06-08 2016-10-10
6 6 14274 76 female <NA> <NA> 2016-01-23
7 7 14132 16 male confirmed <NA> 2015-10-05
// tags: id:case_id, date_onset:date_onset, date_reporting:date_sample, gender:gender, age:age
Key Points
- Use
{cleanepi}
package to clean and standardize epidemic and outbreak data - Use linelist to tagg, validate, and prepare case data for downstream analysis.
Content from Aggregate and visulaize
Last updated on 2024-04-29 | Edit this page
Overview
Questions
- How to aggregate case data?
- How to visualize aggregated data?
- What is distribution of cases in time, place, gender, age?
Objectives
- Simulate synthetic outbreak data
- Convert linelist data to incidence
- Create epidemic curves from incidence data
Introduction
In an analytic pipeline, exploratory data analysis (EDA) is an important step before formal modelling. EDA helps determine relationships between variables and summarize their main characteristics often by means of data visualization.
This episode focuses on EDA of outbreaks and epidemic data, and how
to achieved that using a couples of handy R
packages. A key
observation in EDA of epidemic analysis is capturing the relationship
between time and the number of reported cases, spanning various
categories (confirmed, hospitalized, deaths, and recoveries), locations,
and other demographic factors such as gender, age, etc.
Synthetic outbreak data
To illustrate the process of conducting EDA on outbreak data, we will
generate a line list for a hypothetical Ebola outbreak utilizing the
{simulist}
package. This line list dataset offers
individual-level information about the outbreak. For our simulation, we
will assume that the dynamics of this outbreak are influenced by several
factors: the contact distribution (average number of contacts for an
infected case), distribution of contact intervals (time period between
contacts), and the delay distributions of onset to hospitalization and
onset to death. These latter distributions can be sourced from
literature and are conveniently available in the
{epiparameter}
package, see the below code chunk.
R
# Load simulist and epiparameter packages
library("simulist")
library("epiparameter")
# Define contact distribution
contact_dist <- epiparameter::epidist(
disease = "Ebola",
epi_dist = "contact distribution",
prob_distribution = "pois",
prob_distribution_params = c(mean = 2)
)
# Define distribution for interval between contact
cont_interval <- epiparameter::epidist(
disease = "Ebola",
epi_dist = "contact interval",
prob_distribution = "gamma",
prob_distribution_params = c(shape = 1, scale = 1)
)
# Define onset to hospitalized distribution
onset_to_hosp <- contact_dist <- epiparameter::epidist(
disease = "Ebola",
epi_dist = "onset to hospitalisatio",
prob_distribution = "pois",
prob_distribution_params = c(mean = 7)
)
# get onset to death from {epiparameter} database
onset_to_death <- epiparameter::epidist_db(
disease = "Ebola",
epi_dist = "onset to death",
single_epidist = TRUE
)
# Define distribution for infectious period
infect_period <- epiparameter::epidist(
disease = "Ebola",
epi_dist = "Infectious period",
prob_distribution = "gamma",
prob_distribution_params = c(shape = 1, scale = 1)
)
Additionally, we assume that the outbreak started at the beginning of 2023, is highly contagious with a probability of infection of \(80\%\), and its minimum and maximum sizes are 1000 and 10,000, respectively. Combining these assumptions with the mentioned distributions, the code chunk below generates a simulated line list:
R
# Set seed to 1 to have the same results
base::set.seed(1)
# Generate simulation data using the defined distribution.
linelist <- simulist::sim_linelist(
contact_dist,
infect_period,
prob_infect = 0.6,
onset_to_hosp,
onset_to_death,
hosp_risk = 0.2,
hosp_death_risk = 0.5,
non_hosp_death_risk = 0.05,
outbreak_start_date = as.Date("2023-01-01"),
add_names = TRUE,
add_ct = TRUE,
outbreak_size = c(1000, 10000),
population_age = c(1, 90),
case_type_probs = c(suspected = 0.2, probable = 0.1, confirmed = 0.7),
config = simulist::create_config()
)
# View first few rows of the generated data
utils::head(linelist)
OUTPUT
id case_name case_type sex age date_onset date_admission
1 1 Keegan Hardy-Roy confirmed m 3 2023-01-01 <NA>
2 6 Alexandria Torres-Perez probable f 13 2023-01-01 <NA>
3 7 Mitchell Reinhart confirmed m 74 2023-01-01 2023-01-08
4 9 Stephanie Loadman-Copeland confirmed f 65 2023-01-01 2023-01-10
5 10 Sufyaan al-Irani confirmed m 8 2023-01-02 <NA>
6 11 Rohan Nguyen confirmed m 27 2023-01-01 <NA>
date_death date_first_contact date_last_contact ct_value
1 <NA> <NA> <NA> 24.9
2 <NA> 2022-12-30 2023-01-02 NA
3 <NA> 2022-12-31 2023-01-04 24.9
4 <NA> 2023-01-03 2023-01-04 24.9
5 <NA> 2022-12-31 2023-01-02 24.9
6 <NA> 2023-01-01 2023-01-05 24.9
Aggregating
Downstream analysis involves working with aggregated data rather than
individual cases. This requires grouping linelist data in the form of
incidence data. The incidence2
package offers an essential function, called incidence
, for
grouping case data, usually centered around dated events and/or other
factors. The code chunk provided below demonstrates the creation of an
incidence2
object from the simulated Ebola
linelist
data based on the date of onset.
R
# load incidence2 package
library("incidence2")
# create incidence object by aggregating case data based on the date of onset
dialy_incidence_data <- incidence2::incidence(
linelist,
date_index = "date_onset",
interval = 1
)
# View the first incidence data for the first 5 days
utils::head(dialy_incidence_data, 5)
OUTPUT
# incidence: 5 x 3
# count vars: date_onset
date_index count_variable count
* <period> <chr> <int>
1 2023-01-01 date_onset 475
2 2023-01-02 date_onset 4904
3 2023-01-03 date_onset 5478
4 2023-01-04 date_onset 5319
5 2023-01-05 date_onset 3520
Furthermore, with the incidence2 package, you can specify the desired interval and categorize cases by one or more factors. Below is a code snippet demonstrating weekly cases grouped by the date of onset and gender.
R
# Grouping data by week
weekly_incidence_data <- incidence2::incidence(
linelist,
date_index = "date_onset",
interval = 7,
groups = c("sex", "case_type")
)
# View incidence data for the first 5 weeks
utils::head(weekly_incidence_data, 5)
OUTPUT
# incidence: 5 x 5
# count vars: date_onset
# groups: sex, case_type
date_index sex case_type count_variable count
* <period> <chr> <chr> <chr> <int>
1 2022-12-29 to 2023-01-04 f confirmed date_onset 5576
2 2022-12-29 to 2023-01-04 f probable date_onset 766
3 2022-12-29 to 2023-01-04 f suspected date_onset 1612
4 2022-12-29 to 2023-01-04 m confirmed date_onset 5760
5 2022-12-29 to 2023-01-04 m probable date_onset 787
Notes
When cases are grouped by different factors, it’s possible that these
groups may have different date ranges in the resulting
incidence2
object. The incidence2
package
provides a function called complete_dates()
to ensure that
an incidence object has the same range of dates for each group. By
default, missing counts will be filled with 0.
R
# Create incidence object
dialy_incidence_data_2 <- incidence2::incidence(
linelist,
date_index = "date_onset",
groups = "sex",
interval = 1
)
# Complete missing dates in the incidence object
incidence2::complete_dates(dialy_incidence_data_2,
expand = TRUE,
fill = 0L, by = 1L,
allow_POSIXct = FALSE
)
OUTPUT
# incidence: 24 x 4
# count vars: date_onset
# groups: sex
date_index sex count_variable count
<period> <chr> <chr> <int>
1 2023-01-01 f date_onset 242
2 2023-01-01 m date_onset 233
3 2023-01-02 f date_onset 2403
4 2023-01-02 m date_onset 2501
5 2023-01-03 f date_onset 2709
6 2023-01-03 m date_onset 2769
7 2023-01-04 f date_onset 2600
8 2023-01-04 m date_onset 2719
9 2023-01-05 f date_onset 1763
10 2023-01-05 m date_onset 1757
# ℹ 14 more rows
Visualization
“The incidence2
object can be visualized using the
plot()
function from the base R package. The resulting
graph is referred to as an epidemic curve, or epi-curve for short. The
following code snippets generate epi-curves for the
dialy_incidence_data
and weekly_incidence_data
incidence objects mentioned above.”
R
# Load ggplot2 and tracetheme packages
library("ggplot2")
library("tracetheme")
# Plot daily incidence data
base::plot(dialy_incidence_data) + ggplot2::labs(
x = "Time (in days)",
y = "Dialy cases"
) + tracetheme::theme_trace()
R
# Plot weekly incidence data
base::plot(weekly_incidence_data) + ggplot2::labs(
x = "Time (in days)",
y = "weekly cases"
) + tracetheme::theme_trace()
Challenge 1: Can you do it?
- Using suitable distributions for contacts, contact interval, infectious period, onset to hospitalized, and onset to death, generate a simulated linelist data for Marburg outbreak that has the probability of \(0.5\) infection?
- Aggregate the generated linelist and produce some epidemic curves?
Key Points
- Use
{simulist}
package to generate synthetic outbreak data - Use incidence2 package to aggregate case data based on a date event, and produce epidemic curves.
Content from Simple analysis
Last updated on 2024-04-29 | Edit this page
Overview
Questions
- What is the growth rate of the epidemic?
- How to identify the peak time of an outbreak?
- How to compute moving average of cases?
Objectives
- Perform an early analysis of outbreak data
- Identify trends, exponential, doubling, and peak time
Introduction
Understanding the trend in case data is crucial for various purposes, such as forecasting future case counts, implementing public health interventions, and assessing the effectiveness of control measures. By analyzing the trend, policymakers and public health experts can make informed decisions to mitigate the spread of diseases and protect public health. This episode focuses on how to perform a simple early analysis on incidence data. It uses the same dataset of Covid-19 case data from England that utilized it in Aggregate and visualize episode.
Simple model
Aggregated case data over a specific time unit, or incidence data,
typically represent the number of cases occurring within that time
frame. These data can often be assumed to follow either
Poisson distribution
or a
negative binomial (NB) distribution
, depending on the
specific characteristics of the data and the underlying processes
generating them. When analyzing such data, one common approach is to
examine the trend over time by computing the rate of change, which can
indicate whether there is exponential growth or decay in the number of
cases. Exponential growth implies that the number of cases is increasing
at an accelerating rate over time, while exponential decay suggests that
the number of cases is decreasing at a decelerating rate.
The i2extras
package provides methods for modelling the
trend in case data, calculating moving averages, and exponential growth
or decay rate. The code chunk below computes the Covid-19 trend in UK
within first 3 months using negative binomial distribution.
R
# loads the i2extras package, which provides methods for modeling
library("i2extras")
# This line loads the i2extras package, which provides methods for modeling
library("incidence2")
# subset the covid19_eng_case_data to include only the first 3 months of data
covid19_eng_case_data <- outbreaks::covid19_england_nhscalls_2020
df <- base::subset(
covid19_eng_case_data,
covid19_eng_case_data$date <= min(covid19_eng_case_data$date) + 90
)
# uses the incidence function from the incidence2 package to compute the
# incidence data
df_incid <- incidence2::incidence(df, date_index = "date", groups = "sex")
# fit a curve to the incidence data. The model chosen is the negative binomial
# distribution with a significance level (alpha) of 0.05.
fitted_curve_nb <- i2extras::fit_curve(df_incid, model = "negbin", alpha = 0.05)
base::plot(fitted_curve_nb, angle = 45) + ggplot2::labs(x = "Date", y = "Cases")
R
fitted_curve_poisson <- i2extras::fit_curve(df_incid, model = "poisson",
alpha = 0.05)
base::plot(fitted_curve_poisson, angle = 45) +
ggplot2::labs(x = "Date", y = "Cases")
Exponential growth or decay rate
The exponential growth or decay rate, denoted as \(r\), serves as an indicator for the trend in cases, indicating whether they are increasing (growth) or decreasing (decay) on an exponential scale. This rate is computed using the so-called renewal equation (Wallinga et al. 2006), which mechanistically links the reproductive number \(R\) of new cases to the generation interval of the disease. This computational method is implemented in the i2extras package.
Below is a code snippet demonstrating how to extract the growth/decay
rate from the above NB-fitted curve using the
growth_rate()
function:
R
rates_nb <- i2extras::growth_rate(fitted_curve_nb)
rates_nb <- base::as.data.frame(rates_nb) |>
subset(select = c(sex, r, r_lower, r_upper))
base::print(rates_nb)
OUTPUT
sex r r_lower r_upper
1 female -0.008241228 -0.009182635 -0.007300403
2 male -0.008346783 -0.009316775 -0.007377392
3 unknown -0.023703987 -0.028179436 -0.019299926
Peak time
The Peak time is the time at which the highest
number of cases is observed in the aggregated data. It can be estimated
using the i2extras::estimate_peak()
function as shown in
the below code chunk, which identify peak time from the
incidenc2
object df_incid
.
R
peaks_nb <- i2extras::estimate_peak(df_incid, progress = FALSE) |>
subset(select = -c(count_variable, bootstrap_peaks))
base::print(peaks_nb)
OUTPUT
# A data frame: 3 × 6
sex observed_peak observed_count lower_ci median upper_ci
* <chr> <date> <int> <date> <date> <date>
1 female 2020-03-26 1314 2020-03-18 2020-03-23 2020-03-29
2 male 2020-03-27 1299 2020-03-18 2020-03-23 2020-03-30
3 unknown 2020-04-10 32 2020-03-24 2020-04-10 2020-04-16
Moving average
A moving or rolling average calculates the average number of cases
within a specified time period. This can be achieved by utilizing the
add_rolling_average()
function from the
i2extras package on an incidence2 object
.
The following code chunk demonstrates the computation of the weekly
average number of cases from the incidence2
object
df_incid
, followed by visualization.
R
library("ggplot2")
moving_Avg_week <- i2extras::add_rolling_average(df_incid, n = 7L)
base::plot(moving_Avg_week, border_colour = "white", angle = 45) +
ggplot2::geom_line(ggplot2::aes(x = date_index, y = rolling_average,
color = "red")) +
ggplot2::labs(x = "Date", y = "Cases")
R
moving_Avg_mont <- i2extras::add_rolling_average(df_incid, n = 30L)
base::plot(moving_Avg_mont, border_colour = "white", angle = 45) +
ggplot2::geom_line(ggplot2::aes(x = date_index, y = rolling_average,
color = "red")) +
ggplot2::labs(x = "Date", y = "Cases")
Key Points
- Use i2extras to:
- fit epi-curve using either Poisson or
NB distributions,
- calculate exponential growth or decline of cases,
- find peak time, and
- computing moving average of cases in specified time window.
- fit epi-curve using either Poisson or
NB distributions,
Content from Access epidemiological delay distributions
Last updated on 2024-05-04 | Edit this page
Overview
Questions
- How to get access to disease delay distributions from a pre-established database for use in analysis?
Objectives
- Get delays from a literature search database with
{epiparameter}
. - Get distribution parameters and summary statistics of delay distributions.
Introduction
Infectious diseases follow an infection cycle, which usually includes the following phases: presymptomatic period, symptomatic period and recovery period, as described by their natural history. These time periods can be used to understand transmission dynamics and inform disease prevention and control interventions.
Definitions
Look at the glossary for the definitions of all the time periods of the figure above!
However, early in an epidemic, modelling efforts can be delayed by
the lack of a centralised resource that summarises input parameters for
the disease of interest (Nash et al., 2023).
Projects like {epiparameter}
and {epireview}
are building online catalogues following literature synthesis protocols
that can help parametrise models by easily accessing a comprenhensive
library of previously estimated epidemiological parameters from past
outbreaks.
To exemplify how to use the {epiparameter}
R package in
your analysis pipeline, our goal in this episode will be to choose one
specific set of epidemiological parameters from the literature, instead
of copying-and-pasting them by hand, to plug them into an
EpiNow2 analysis workflow.
Let’s start loading the {epiparameter}
package. We’ll
use the pipe %>%
to connect some of their functions,
some tibble and dplyr functions, so let’s
also call to the tidyverse package:
R
library(epiparameter)
library(tidyverse)
The problem
If we want to estimate the transmissibility of an infection, it’s
common to use a package such as EpiEstim or
EpiNow2. However, both require some epidemiological
information as an input. For example, in EpiNow2 we use
EpiNow2::dist_spec()
to specify a generation time as a
probability distribution
adding its mean
,
standard deviation (sd
), and maximum value
(max
). To specify a generation_time
that
follows a Gamma distribution with mean \(\mu = 4\), standard deviation \(\sigma = 2\), and a maximum value of 20, we
write:
R
generation_time <-
EpiNow2::dist_spec(
mean = 4,
sd = 2,
max = 20,
distribution = "gamma"
)
It is a common practice for analysts to manually search the available
literature and copy and paste the summary statistics or
the distribution parameters from scientific
publications. A challenge that is often faced is that the reporting of
different statistical distributions is not consistent across the
literature. {epiparameter}
’s objective is to facilitate the
access to reliable estimates of distribution parameters for a range of
infectious diseases, so that they can easily be implemented in outbreak
analytic pipelines.
In this episode, we will choose the summary statistics from
the library of epidemiological parameters provided by
{epiparameter}
.
Generation time vs serial interval
The generation time, jointly with the reproduction number (\(R\)), provide valuable insights on the strength of transmission and inform the implementation of control measures. Given a \(R>1\), the shorter the generation time, the earlier the incidence of disease cases will grow.
In calculating the effective reproduction number (\(R_{t}\)), the generation time distribution is often approximated by the serial interval distribution. This frequent approximation is because it is easier to observe and measure the onset of symptoms than the onset of infectiousness.
However, using the serial interval as an approximation of the generation time is primarily valid for diseases in which infectiousness starts after symptom onset (Chung Lau et al., 2021). In cases where infectiousness starts before symptom onset, the serial intervals can have negative values, which is the case for diseases with pre-symptomatic transmission (Nishiura et al., 2020).
From time periods to probability distributions.
When we calculate the serial interval, we see that not all case pairs have the same time length. We will observe this variability for any case pair and individual time period, including the incubation period and infectious period.
To summarise these data from individual and pair time periods, we can find the statistical distributions that best fit the data (McFarland et al., 2023).
Statistical distributions are summarised in terms of their summary statistics like the location (mean and percentiles) and spread (variance or standard deviation) of the distribution, or with their distribution parameters that inform about the form (shape and rate/scale) of the distribution. These estimated values can be reported with their uncertainty (95% confidence intervals).
Gamma | mean | shape | rate/scale |
---|---|---|---|
MERS-CoV | 14.13(13.9–14.7) | 6.31(4.88–8.52) | 0.43(0.33–0.60) |
COVID-19 | 5.1(5.0–5.5) | 2.77(2.09–3.88) | 0.53(0.38–0.76) |
Weibull | mean | shape | rate/scale |
---|---|---|---|
MERS-CoV | 14.2(13.3–15.2) | 3.07(2.64–3.63) | 16.1(15.0–17.1) |
COVID-19 | 5.2(4.6–5.9) | 1.74(1.46–2.11) | 5.83(5.08–6.67) |
Log normal | mean | mean-log | sd-log |
---|---|---|---|
MERS-CoV | 14.08(13.1–15.2) | 2.58(2.50–2.68) | 0.44(0.39–0.5) |
COVID-19 | 5.2(4.2–6.5) | 1.45(1.31–1.61) | 0.63(0.54–0.74) |
The peak of each curve can inform you about the location of the mean of each distribution. The larger the mean, the larger the serial interval.
Which one would be harder to control?
COVID-19
Why do you conclude that?
COVID-19 has the lowest mean serial interval. The approximate mean value for the serial interval of COVID-19 is around four days, and SARS is about seven days. Thus, COVID-19 will likely have newer generations in less time than SARS, assuming similar reproduction numbers.
Choosing epidemiological parameters
In this section, we will use {epiparameter}
to obtain
the generation time and the serial interval for COVID-19, so these
metrics can be used to estimate the transmissibility of this disease
using EpiNow2 in subsequent sections of this episode.
Let’s start by looking at how many entries are available in the
epidemiological distributions database in {epiparameter}
(epidist_db
) for the disease
named
covid
-19:
R
epiparameter::epidist_db(
disease = "covid"
)
OUTPUT
Returning 27 results that match the criteria (22 are parameterised).
Use subset to filter by entry variables or single_epidist to return a single entry.
To retrieve the short citation for each use the 'get_citation' function
OUTPUT
List of <epidist> objects
Number of entries in library: 27
Number of studies in library: 10
Number of diseases: 1
Number of delay distributions: 27
Number of offspring distributions: 0
The double-colon
The double-colon ::
in R is used to access functions or
objects from a specific package without loading the entire package into
the current environment. This allows for a more targeted approach to
using package components and helps avoid namespace conflicts.
::
lets you call a specific function from a package by
explicitly mentioning the package name. For example,
dplyr::filter(data, condition)
uses filter()
from the dplyr package without loading the entire
package.
From the {epiparameter}
package, we can use the
epidist_db()
function to ask for any disease
and also for a specific epidemiological distribution
(epi_dist
).
Let’s ask now how many parameters we have in the epidemiological
distributions database (epidist_db
) with the generation
time using the string generation
:
R
epiparameter::epidist_db(
epi_dist = "generation"
)
OUTPUT
Returning 1 results that match the criteria (1 are parameterised).
Use subset to filter by entry variables or single_epidist to return a single entry.
To retrieve the short citation for each use the 'get_citation' function
OUTPUT
Disease: Influenza
Pathogen: Influenza-A-H1N1
Epi Distribution: generation time
Study: Lessler J, Reich N, Cummings D, New York City Department of Health and
Mental Hygiene Swine Influenza Investigation Team (2009). "Outbreak of
2009 Pandemic Influenza A (H1N1) at a New York City School." _The New
England Journal of Medicine_. doi:10.1056/NEJMoa0906089
<https://doi.org/10.1056/NEJMoa0906089>.
Distribution: weibull
Parameters:
shape: 2.360
scale: 3.180
Currently, in the library of epidemiological parameters, we have one
generation
time entry for Influenza. Considering the
above-mentioned considerations, we can look at the serial
intervals for COVID
-19. Run this locally!
R
epiparameter::epidist_db(
disease = "COVID",
epi_dist = "serial"
)
With this query combination, we get more than one delay distribution.
This output is an <epidist>
class object.
CASE-INSENSITIVE
epidist_db
is case-insensitive.
This means that you can use strings with letters in upper or lower case
indistinctly. Strings like "serial"
,
"serial interval"
or "serial_interval"
are
also valid.
To summarise an <epidist>
object and get the
column names from the underlying parameter database, we can add the
epiparameter::list_distributions()
function to the previous
code using the pipe %>%
:
R
epiparameter::epidist_db(
disease = "covid",
epi_dist = "serial"
) %>%
epiparameter::list_distributions()
OUTPUT
Returning 4 results that match the criteria (3 are parameterised).
Use subset to filter by entry variables or single_epidist to return a single entry.
To retrieve the short citation for each use the 'get_citation' function
OUTPUT
disease epi_distribution prob_distribution author year
1 COVID-19 serial interval <NA> Muluneh .... 2021
2 COVID-19 serial interval lnorm Hiroshi .... 2020
3 COVID-19 serial interval weibull Hiroshi .... 2020
4 COVID-19 serial interval norm Lin Yang.... 2020
In the epiparameter::list_distributions()
output, we can
also find different types of probability distributions (e.g.,
Log-normal, Weibull, Normal).
{epiparameter}
uses the base
R naming
convention for distributions. This is why Lognormal is
called lnorm
.
Entries with a missing value (<NA>
) in the
prob_distribution
column are non-parameterised
entries. They have summary statistics but no probability distribution.
Compare these two outputs:
R
# get an <epidist> object
distribution <-
epiparameter::epidist_db(
disease = "covid",
epi_dist = "serial"
)
distribution %>%
# pluck the first entry in the object class <list>
pluck(1) %>%
# check if <epidist> object have distribution parameters
is_parameterised()
# check if the second <epidist> object
# have distribution parameters
distribution %>%
pluck(2) %>%
is_parameterised()
Parameterised entries have an Inference method
As detailed in ?is_parameterised
, a parameterised
distribution is the entry that has a probability distribution associated
with it provided by an inference_method
as shown in
metadata
:
R
distribution[[1]]$metadata$inference_method
distribution[[2]]$metadata$inference_method
distribution[[4]]$metadata$inference_method
Find your delay distributions!
Take 2 minutes to explore the {epiparameter}
library.
Choose a disease of interest (e.g., Influenza, Measles, etc.) and a delay distribution (e.g., the incubation period, onset to death, etc.).
Find:
How many delay distributions are for that disease?
How many types of probability distribution (e.g., gamma, lognormal) are for a given delay in that disease?
Ask:
Do you recognise the papers?
Should
{epiparameter}
literature review consider any other paper?
The epidist_db()
function with disease
alone counts the number of entries like:
- studies, and
- delay distributions.
The epidist_db()
function with disease
and
epi_dist
gets a list of all entries with:
- the complete citation,
- the type of a probability distribution, and
- distribution parameter values.
The combo of epidist_db()
plus
list_distributions()
gets a data frame of all entries with
columns like:
- the type of the probability distribution per delay, and
- author and year of the study.
We choose to explore Ebola’s delay distributions:
R
# we expect 16 delays distributions for ebola
epiparameter::epidist_db(
disease = "ebola"
)
OUTPUT
Returning 17 results that match the criteria (17 are parameterised).
Use subset to filter by entry variables or single_epidist to return a single entry.
To retrieve the short citation for each use the 'get_citation' function
OUTPUT
List of <epidist> objects
Number of entries in library: 17
Number of studies in library: 4
Number of diseases: 1
Number of delay distributions: 16
Number of offspring distributions: 1
Now, from the output of epiparameter::epidist_db()
, What
is an offspring
distribution?
We choose to find Ebola’s incubation periods. This output list all the papers and parameters found. Run this locally if needed:
R
epiparameter::epidist_db(
disease = "ebola",
epi_dist = "incubation"
)
We use list_distributions()
to get a summary display of
all:
R
# we expect 2 different types of delay distributions
# for ebola incubation period
epiparameter::epidist_db(
disease = "ebola",
epi_dist = "incubation"
) %>%
list_distributions()
OUTPUT
Returning 5 results that match the criteria (5 are parameterised).
Use subset to filter by entry variables or single_epidist to return a single entry.
To retrieve the short citation for each use the 'get_citation' function
OUTPUT
disease epi_distribution prob_distribution author year
1 Ebola Virus Disease incubation period lnorm Martin E.... 2011
2 Ebola Virus Disease incubation period gamma WHO Ebol.... 2015
3 Ebola Virus Disease incubation period gamma WHO Ebol.... 2015
4 Ebola Virus Disease incubation period gamma WHO Ebol.... 2015
5 Ebola Virus Disease incubation period gamma WHO Ebol.... 2015
We find two types of probability distributions for this query: lognormal and gamma.
How does {epiparameter}
do the collection and review of
peer-reviewed literature? We invite you to read the vignette on “Data
Collation and Synthesis Protocol”!
Select a single distribution
The epiparameter::epidist_db()
function works as a
filtering or subset function. Let’s use the author
argument
to filter Hiroshi Nishiura
parameters:
R
epiparameter::epidist_db(
disease = "covid",
epi_dist = "serial",
author = "Hiroshi"
) %>%
epiparameter::list_distributions()
OUTPUT
Returning 2 results that match the criteria (2 are parameterised).
Use subset to filter by entry variables or single_epidist to return a single entry.
To retrieve the short citation for each use the 'get_citation' function
OUTPUT
disease epi_distribution prob_distribution author year
1 COVID-19 serial interval lnorm Hiroshi .... 2020
2 COVID-19 serial interval weibull Hiroshi .... 2020
We still get more than one epidemiological parameter. We can set the
single_epidist
argument to TRUE
to only
one:
R
epiparameter::epidist_db(
disease = "covid",
epi_dist = "serial",
author = "Hiroshi",
single_epidist = TRUE
)
OUTPUT
Using Nishiura H, Linton N, Akhmetzhanov A (2020). "Serial interval of novel
coronavirus (COVID-19) infections." _International Journal of
Infectious Diseases_. doi:10.1016/j.ijid.2020.02.060
<https://doi.org/10.1016/j.ijid.2020.02.060>..
To retrieve the short citation use the 'get_citation' function
OUTPUT
Disease: COVID-19
Pathogen: SARS-CoV-2
Epi Distribution: serial interval
Study: Nishiura H, Linton N, Akhmetzhanov A (2020). "Serial interval of novel
coronavirus (COVID-19) infections." _International Journal of
Infectious Diseases_. doi:10.1016/j.ijid.2020.02.060
<https://doi.org/10.1016/j.ijid.2020.02.060>.
Distribution: lnorm
Parameters:
meanlog: 1.386
sdlog: 0.568
How does ‘single_epidist’ works?
Looking at the help documentation for
?epiparameter::epidist_db()
:
- If multiple entries match the arguments supplied and
single_epidist = TRUE
, then the parameterised<epidist>
with the largest sample size will be returned. - If multiple entries are equal after this sorting, the first entry will be returned.
What is a parametrised <epidist>
? Look at
?is_parameterised
.
Let’s assign this <epidist>
class object to the
covid_serialint
object.
R
covid_serialint <-
epiparameter::epidist_db(
disease = "covid",
epi_dist = "serial",
author = "Nishiura",
single_epidist = TRUE
)
You can use plot()
to <epidist>
objects to visualise:
- the Probability Density Function (PDF) and
- the Cumulative Distribution Function (CDF).
R
# plot <epidist> object
plot(covid_serialint)
With the day_range
argument, you can change the length
or number of days in the x
axis. Explore what this looks
like:
R
# plot <epidist> object
plot(covid_serialint, day_range = 0:20)
Extract the summary statistics
We can get the mean
and standard deviation
(sd
) from this <epidist>
diving into the
summary_stats
object:
R
# get the mean
covid_serialint$summary_stats$mean
OUTPUT
[1] 4.7
Now, we have an epidemiological parameter we can reuse! We can
replace the summary statistics numbers we plug into the
EpiNow2::dist_spec()
function:
R
generation_time <-
EpiNow2::dist_spec(
mean = covid_serialint$summary_stats$mean, # replaced!
sd = covid_serialint$summary_stats$sd, # replaced!
max = 20,
distribution = "gamma"
)
In the next episode we’ll learn how to use EpiNow2 to
correctly specify distributions, estimate transmissibility. Then, how to
use distribution functions to get a maximum value
(max
) for EpiNow2::dist_spec()
and use
{epiparameter}
in your analysis.
Log normal distributions
If you need the log normal distribution parameters
instead of the summary statistics, we can use
epiparameter::get_parameters()
:
R
covid_serialint_parameters <-
epiparameter::get_parameters(covid_serialint)
covid_serialint_parameters
OUTPUT
meanlog sdlog
1.3862617 0.5679803
This gets a vector of class <numeric>
ready to use
as input for any other package!
Challenges
R
# ebola serial interval
ebola_serial <-
epiparameter::epidist_db(
disease = "ebola",
epi_dist = "serial",
single_epidist = TRUE
)
OUTPUT
Using WHO Ebola Response Team, Agua-Agum J, Ariyarajah A, Aylward B, Blake I,
Brennan R, Cori A, Donnelly C, Dorigatti I, Dye C, Eckmanns T, Ferguson
N, Formenty P, Fraser C, Garcia E, Garske T, Hinsley W, Holmes D,
Hugonnet S, Iyengar S, Jombart T, Krishnan R, Meijers S, Mills H,
Mohamed Y, Nedjati-Gilani G, Newton E, Nouvellet P, Pelletier L,
Perkins D, Riley S, Sagrado M, Schnitzler J, Schumacher D, Shah A, Van
Kerkhove M, Varsaneux O, Kannangarage N (2015). "West African Ebola
Epidemic after One Year — Slowing but Not Yet under Control." _The New
England Journal of Medicine_. doi:10.1056/NEJMc1414992
<https://doi.org/10.1056/NEJMc1414992>..
To retrieve the short citation use the 'get_citation' function
R
ebola_serial
OUTPUT
Disease: Ebola Virus Disease
Pathogen: Ebola Virus
Epi Distribution: serial interval
Study: WHO Ebola Response Team, Agua-Agum J, Ariyarajah A, Aylward B, Blake I,
Brennan R, Cori A, Donnelly C, Dorigatti I, Dye C, Eckmanns T, Ferguson
N, Formenty P, Fraser C, Garcia E, Garske T, Hinsley W, Holmes D,
Hugonnet S, Iyengar S, Jombart T, Krishnan R, Meijers S, Mills H,
Mohamed Y, Nedjati-Gilani G, Newton E, Nouvellet P, Pelletier L,
Perkins D, Riley S, Sagrado M, Schnitzler J, Schumacher D, Shah A, Van
Kerkhove M, Varsaneux O, Kannangarage N (2015). "West African Ebola
Epidemic after One Year — Slowing but Not Yet under Control." _The New
England Journal of Medicine_. doi:10.1056/NEJMc1414992
<https://doi.org/10.1056/NEJMc1414992>.
Distribution: gamma
Parameters:
shape: 2.188
scale: 6.490
R
# get the sd
ebola_serial$summary_stats$sd
OUTPUT
[1] 9.6
R
# get the sample_size
ebola_serial$metadata$sample_size
OUTPUT
[1] 305
Try to visualise this distribution using plot()
.
Also, explore all the other nested elements within the
<epidist>
object.
Share about:
- What elements do you find useful for your analysis?
- What other elements would you like to see in this object? How?
Ebola’s severity parameter
A severity parameter like the duration of hospitalisation could add to the information needed about the bed capacity in response to an outbreak (Cori et al., 2017).
For Ebola:
- What is the reported point estimate of the mean duration of health care and case isolation?
An informative delay should measure the time from symptom onset to recovery or death.
Find a way to access the whole {epiparameter}
database
and find how that delay may be stored. The
list_distributions()
output is a dataframe.
R
# one way to get the list of all the available parameters
epidist_db(disease = "all") %>%
list_distributions() %>%
as_tibble() %>%
distinct(epi_distribution)
OUTPUT
Returning 122 results that match the criteria (99 are parameterised).
Use subset to filter by entry variables or single_epidist to return a single entry.
To retrieve the short citation for each use the 'get_citation' function
OUTPUT
# A tibble: 12 × 1
epi_distribution
<chr>
1 incubation period
2 serial interval
3 generation time
4 onset to death
5 offspring distribution
6 hospitalisation to death
7 hospitalisation to discharge
8 notification to death
9 notification to discharge
10 onset to discharge
11 onset to hospitalisation
12 onset to ventilation
R
ebola_severity <- epidist_db(
disease = "ebola",
epi_dist = "onset to discharge"
)
OUTPUT
Returning 1 results that match the criteria (1 are parameterised).
Use subset to filter by entry variables or single_epidist to return a single entry.
To retrieve the short citation for each use the 'get_citation' function
R
# point estimate
ebola_severity$summary_stats$mean
OUTPUT
[1] 15.1
Check that for some {epiparameter}
entries you will also
have the uncertainty around the point estimate of each
summary statistic:
R
# 95% confidence intervals
ebola_severity$summary_stats$mean_ci
OUTPUT
[1] 95
R
# limits of the confidence intervals
ebola_severity$summary_stats$mean_ci_limits
OUTPUT
[1] 14.6 15.6
The distribution Zoo
Explore this shinyapp called The Distribution Zoo!
Follow these steps to reproduce the form of the COVID serial interval
distribution from {epiparameter}
(covid_serialint
object):
- Access the https://ben18785.shinyapps.io/distribution-zoo/ shiny app website,
- Go to the left panel,
- Keep the Category of distribution:
Continuous Univariate
, - Select a new Type of distribution:
Log-Normal
, - Move the sliders, i.e. the graphical control
element that allows you to adjust a value by moving a handle along a
horizontal track or bar to the
covid_serialint
parameters.
Replicate these with the distribution
object and all its
list elements: [[2]]
, [[3]]
, and
[[4]]
. Explore how the shape of a distribution changes when
its parameters change.
Share about:
- What other features of the website do you find helpful?
Key Points
- Use
{epiparameter}
to access the literature catalogue of epidemiological delay distributions. - Use
epidist_db()
to select single delay distributions. - Use
list_distributions()
for an overview of multiple delay distributions. - Reuse known estimates for unknown disease in the early stage of an outbreak when no contact tracing data is available.
Content from Quantifying transmission
Last updated on 2024-05-04 | Edit this page
Overview
Questions
- How can I estimate the time-varying reproduction number (\(Rt\)) and growth rate from a time series of case data?
- How can I quantify geographical heterogeneity from these transmission metrics?
Objectives
- Learn how to estimate transmission metrics from a time series of
case data using the R package
EpiNow2
Reminder: the Effective Reproduction Number,\(R_t\)
The basic reproduction number, \(R_0\), is the average number of cases caused by one infectious individual in a entirely susceptible population.
But in an ongoing outbreak, the population does not remain entirely susceptible as those that recover from infection are typically immune. Moreover, there can be changes in behaviour or other factors that affect transmission. When we are interested in monitoring changes in transmission we are therefore more interested in the value of the effective reproduction number, \(R_t\), the average number of cases caused by one infectious individual in the population at time \(t\).
Introduction
The transmission intensity of an outbreak is quantified using two key metrics: the reproduction number, which informs on the strength of the transmission by indicating how many new cases are expected from each existing case; and the growth rate, which informs on the speed of the transmission by indicating how rapidly the outbreak is spreading or declining (doubling/halving time) within a population. For more details on the distinction between speed and strength of transmission and implications for control, review Dushoff & Park, 2021.
To estimate these key metrics using case data we must account for delays between the date of infections and date of reported cases. In an outbreak situation, data are usually available on reported dates only, therefore we must use estimation methods to account for these delays when trying to understand changes in transmission over time.
In the next tutorials we will focus on how to use the functions in EpiNow2 to estimate transmission metrics of case data. We will not cover the theoretical background of the models or inference framework, for details on these concepts see the vignette.
In this tutorial we are going to learn how to use the
EpiNow2 package to estimate the time-varying reproduction
number. We’ll use the dplyr package to arrange some of
its inputs, ggplot2 to visualize case distribution, and
the pipe %>%
to connect some of their functions, so
let’s also call to the tidyverse package:
R
library(EpiNow2)
library(tidyverse)
The double-colon
The double-colon ::
in R is used to access functions or
objects from a specific package without loading the entire package into
the current environment. This allows for a more targeted approach to
using package components and helps avoid namespace conflicts.
::
lets you call a specific function from a package by
explicitly mentioning the package name. For example,
dplyr::filter(data, condition)
uses filter()
from the dplyr package without loading the entire
package.
Bayesian inference
The R package EpiNow2
uses a Bayesian inference framework to
estimate reproduction numbers and infection times based on reporting
dates.
In Bayesian inference, we use prior knowledge (prior distributions) with data (in a likelihood function) to find the posterior probability.
Posterior probability \(\propto\) likelihood \(\times\) prior probability
Delay distributions and case data
Case data
To illustrate the functions of EpiNow2
we will use
outbreak data of the start of the COVID-19 pandemic from the United
Kingdom. The data are available in the R package
incidence2.
R
dplyr::as_tibble(incidence2::covidregionaldataUK)
OUTPUT
# A tibble: 6,370 × 13
date region region_code cases_new cases_total deaths_new deaths_total
<date> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2020-01-30 East Mi… E12000004 NA NA NA NA
2 2020-01-30 East of… E12000006 NA NA NA NA
3 2020-01-30 England E92000001 2 2 NA NA
4 2020-01-30 London E12000007 NA NA NA NA
5 2020-01-30 North E… E12000001 NA NA NA NA
6 2020-01-30 North W… E12000002 NA NA NA NA
7 2020-01-30 Norther… N92000002 NA NA NA NA
8 2020-01-30 Scotland S92000003 NA NA NA NA
9 2020-01-30 South E… E12000008 NA NA NA NA
10 2020-01-30 South W… E12000009 NA NA NA NA
# ℹ 6,360 more rows
# ℹ 6 more variables: recovered_new <dbl>, recovered_total <dbl>,
# hosp_new <dbl>, hosp_total <dbl>, tested_new <dbl>, tested_total <dbl>
To use the data, we must format the data to have two columns:
-
date
: the date (as a date object see?is.Date()
), -
confirm
: number of confirmed cases on that date.
Let’s use dplyr for this:
R
library(dplyr)
cases <- incidence2::covidregionaldataUK %>%
select(date, cases_new) %>%
group_by(date) %>%
summarise(confirm = sum(cases_new, na.rm = TRUE)) %>%
ungroup()
We can also use the incidence2 package to:
Aggregate cases (similar to the code above) but in different time intervals (i.e., days, weeks or months) or per group categories. Explore later the
incidence2::incidence()
reference manual.Complete dates for all the range of dates per group category using
incidence2::complete_dates()
. Read further in its function reference manual.
R
library(tidyr)
library(dplyr)
incidence2::covidregionaldataUK %>%
# preprocess missing values
tidyr::replace_na(list(cases_new = 0)) %>%
# compute the daily incidence
incidence2::incidence(
date_index = "date",
counts = "cases_new",
count_values_to = "confirm",
date_names_to = "date"
) %>%
# complete range of dates
incidence2::complete_dates()
There are case data available for 490 days, but in an outbreak situation it is likely we would only have access to the beginning of this data set. Therefore we assume we only have the first 90 days of this data.
Delay distributions
We assume there are delays from the time of infection until the time a case is reported. We specify these delays as distributions to account for the uncertainty in individual level differences. The delay can consist of multiple types of delays/processes. A typical delay from time of infection to case reporting may consist of:
time from infection to symptom onset (the incubation period) + time from symptom onset to case notification (the reporting time) .
The delay distribution for each of these processes can either estimated from data or obtained from the literature. We can express uncertainty about what the correct parameters of the distributions by assuming the distributions have fixed parameters or whether they have variable parameters. To understand the difference between fixed and variable distributions, let’s consider the incubation period.
Delays and data
The number of delays and type of delay are a flexible input that depend on the data. The examples below highlight how the delays can be specified for different data sources:
Data source | Delay(s) |
---|---|
Time of symptom onset | Incubation period |
Time of case report | Incubation period + time from symptom onset to case notification |
Time of hospitalisation | Incubation period + time from symptom onset to hospitalisation |
Incubation period distribution
The distribution of incubation period for many diseases can usually
be obtained from the literature. The package {epiparameter}
contains a library of epidemiological parameters for different diseases
obtained from the literature.
We specify a (fixed) gamma distribution with mean \(\mu = 4\) and standard deviation \(\sigma= 2\) (shape = \(4\), scale = \(1\)) using the function
dist_spec()
as follows:
R
incubation_period_fixed <- dist_spec(
mean = 4, sd = 2,
max = 20, distribution = "gamma"
)
incubation_period_fixed
OUTPUT
Fixed distribution with PMF [0.019 0.12 0.21 0.21 0.17 0.11 0.069 0.039 0.021 0.011 0.0054 0.0026 0.0012 0.00058 0.00026 0.00012 5.3e-05 2.3e-05 1e-05 4.3e-06]
The argument max
is the maximum value the distribution
can take, in this example 20 days.
Why a gamma distrubution?
The incubation period has to be positive in value. Therefore we must
specific a distribution in dist_spec
which is for positive
values only.
dist_spec()
supports log normal and gamma distributions,
which are distributions for positive values only.
For all types of delay, we will need to use distributions for positive values only - we don’t want to include delays of negative days in our analysis!
Including distribution uncertainty
To specify a variable distribution, we include uncertainty around the mean \(\mu\) and standard deviation \(\sigma\) of our gamma distribution. If our incubation period distribution has a mean \(\mu\) and standard deviation \(\sigma\), then we assume the mean (\(\mu\)) follows a Normal distribution with standard deviation \(\sigma_{\mu}\):
\[\mbox{Normal}(\mu,\sigma_{\mu}^2)\]
and a standard deviation (\(\sigma\)) follows a Normal distribution with standard deviation \(\sigma_{\sigma}\):
\[\mbox{Normal}(\sigma,\sigma_{\sigma}^2).\]
We specify this using dist_spec
with the additional
arguments mean_sd
(\(\sigma_{\mu}\)) and sd_sd
(\(\sigma_{\sigma}\)).
R
incubation_period_variable <- dist_spec(
mean = 4, sd = 2,
mean_sd = 0.5, sd_sd = 0.5,
max = 20, distribution = "gamma"
)
incubation_period_variable
OUTPUT
Uncertain gamma distribution with (untruncated) mean 4 (SD 0.5) and SD 2 (SD 0.5)
Reporting delays
After the incubation period, there will be an additional delay of time from symptom onset to case notification: the reporting delay. We can specify this as a fixed or variable distribution, or estimate a distribution from data.
When specifying a distribution, it is useful to visualise the
probability density to see the peak and spread of the distribution, in
this case we will use a log normal distribution. We can use the
functions convert_to_logmean()
and
convert_to_logsd()
to convert the mean and standard
deviation of a normal distribution to that of a log normal
distribution.
If we want to assume that the mean reporting delay is 2 days (with a standard deviation of 1 day), the log normal distribution will look like:
R
log_mean <- convert_to_logmean(2, 1)
log_sd <- convert_to_logsd(2, 1)
x <- seq(from = 0, to = 10, length = 1000)
df <- data.frame(x = x, density = dlnorm(x, meanlog = log_mean, sdlog = log_sd))
ggplot(df) +
geom_line(
aes(x, density)
) +
theme_grey(
base_size = 15
)
Using the mean and standard deviation for the log normal
distribution, we can specify a fixed or variable distribution using
dist_spec()
as before:
R
reporting_delay_variable <- dist_spec(
mean = log_mean, sd = log_sd,
mean_sd = 0.5, sd_sd = 0.5,
max = 10, distribution = "lognormal"
)
If data is available on the time between symptom onset and reporting,
we can use the function estimate_delay()
to estimate a log
normal distribution from a vector of delays. The code below illustrates
how to use estimate_delay()
with synthetic delay data.
R
delay_data <- rlnorm(500, log(5), 1) # synthetic delay data
reporting_delay <- estimate_delay(
delay_data,
samples = 1000,
bootstraps = 10
)
Generation time
We also must specify a distribution for the generation time. Here we will use a log normal distribution with mean 3.6 and standard deviation 3.1 (Ganyani et al. 2020).
R
generation_time_variable <- dist_spec(
mean = 3.6, sd = 3.1,
mean_sd = 0.5, sd_sd = 0.5,
max = 20, distribution = "lognormal"
)
Finding estimates
The function epinow()
is a wrapper for the function
estimate_infections()
used to estimate cases by date of
infection. The generation time distribution and delay distributions must
be passed using the functions generation_time_opts()
and
delay_opts()
respectively.
There are numerous other inputs that can be passed to
epinow()
, see EpiNow2::?epinow()
for more
detail. One optional input is to specify a log normal prior for the
effective reproduction number \(R_t\)
at the start of the outbreak. We specify a mean and standard deviation
as arguments of prior
within rt_opts()
:
R
rt_log_mean <- convert_to_logmean(2, 1)
rt_log_sd <- convert_to_logsd(2, 1)
rt <- rt_opts(prior = list(mean = rt_log_mean, sd = rt_log_sd))
Bayesian inference using Stan
The Bayesian inference is performed using MCMC methods with the
program Stan. There are a number of
default inputs to the Stan functions including the number of chains and
number of samples per chain (see
?EpiNow2::stan_opts()
).
To reduce computation time, we can run chains in parallel. To do
this, we must set the number of cores to be used. By default, 4 MCMC
chains are run (see stan_opts()$chains
), so we can set an
equal number of cores to be used in parallel as follows:
R
withr::local_options(list(mc.cores = 4))
To find the maximum number of available cores on your machine, use
parallel::detectCores()
.
Checklist
Note: In the code below _fixed
distributions are used instead of _variable
(delay
distributions with uncertainty). This is to speed up computation time.
It is generally recommended to use variable distributions that account
for additional uncertainty.
R
# fixed alternatives
generation_time_fixed <- dist_spec(
mean = 3.6, sd = 3.1,
max = 20, distribution = "lognormal"
)
reporting_delay_fixed <- dist_spec(
mean = log_mean, sd = log_sd,
max = 10, distribution = "lognormal"
)
Now you are ready to run EpiNow2::epinow()
to estimate
the time-varying reproduction number:
R
reported_cases <- cases[1:90, ]
estimates <- epinow(
# cases
reported_cases = reported_cases,
# delays
generation_time = generation_time_opts(generation_time_fixed),
delays = delay_opts(incubation_period_fixed + reporting_delay_fixed),
# prior
rt = rt_opts(prior = list(mean = rt_log_mean, sd = rt_log_sd)),
# computation (optional)
stan = stan_opts(samples = 1000, chains = 3)
)
OUTPUT
WARN [2024-05-04 08:57:44] epinow: There were 4 divergent transitions after warmup. See
https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them. -
WARN [2024-05-04 08:57:44] epinow: Examine the pairs() plot to diagnose sampling problems
-
WARN [2024-05-04 08:57:45] epinow: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess -
Do not wait for this to continue
Using stan = stan_opts()
is optional. For the purpose of
this tutorial on reducing computation time, we specified a fixed number
of samples = 1000
and chains = 3
to the
stan
argument using the stan_opts()
function.
We expect this to take approximately 3 minutes.
Remember: Using an appropriate number of samples and chains is crucial for ensuring convergence and obtaining reliable estimates in Bayesian computations using Stan. Inadequate sampling or insufficient chains may lead to issues such as divergent transitions, impacting the accuracy and stability of the inference process.
Results
We can extract and visualise estimates of the effective reproduction number through time:
R
estimates$plots$R
The uncertainty in the estimates increases through time. This is because estimates are informed by data in the past - within the delay periods. This difference in uncertainty is categorised into Estimate (green) utilises all data and Estimate based on partial data (orange) estimates that are based on less data (because infections that happened at the time are more likely to not have been observed yet) and therefore have increasingly wider intervals towards the date of the last data point. Finally, the Forecast (purple) is a projection ahead of time.
We can also visualise the growth rate estimate through time:
R
estimates$plots$growth_rate
To extract a summary of the key transmission metrics at the latest date in the data:
R
summary(estimates)
OUTPUT
measure estimate
<char> <char>
1: New confirmed cases by infection date 7156 (4069 -- 12721)
2: Expected change in daily cases Likely decreasing
3: Effective reproduction no. 0.89 (0.58 -- 1.3)
4: Rate of growth -0.015 (-0.062 -- 0.037)
5: Doubling/halving time (days) -45 (19 -- -11)
As these estimates are based on partial data, they have a wide uncertainty interval.
From the summary of our analysis we see that the expected change in daily cases is Likely decreasing with the estimated new confirmed cases 7156 (4069 – 12721).
The effective reproduction number \(R_t\) estimate (on the last date of the data) is 0.89 (0.58 – 1.3).
The exponential growth rate of case numbers is -0.015 (-0.062 – 0.037).
The doubling time (the time taken for case numbers to double) is -45 (19 – -11).
Expected change in daily cases
A factor describing expected change in daily cases based on the posterior probability that \(R_t < 1\).
Probability (\(p\)) | Expected change |
---|---|
\(p < 0.05\) | Increasing |
\(0.05 \leq p< 0.4\) | Likely increasing |
\(0.4 \leq p< 0.6\) | Stable |
\(0.6 \leq p < 0.95\) | Likely decreasing |
\(0.95 \leq p \leq 1\) | Decreasing |
Quantify geographical heterogeneity
The outbreak data of the start of the COVID-19 pandemic from the United Kingdom from the R package incidence2 includes the region in which the cases were recorded. To find regional estimates of the effective reproduction number and cases, we must format the data to have three columns:
-
date
: the date, -
region
: the region, -
confirm
: number of confirmed cases for a region on a given date.
R
regional_cases <-
incidence2::covidregionaldataUK[, c("date", "cases_new", "region")]
colnames(regional_cases) <- c("date", "confirm", "region")
# extract the first 90 dates for all regions
dates <- sort(unique(regional_cases$date))[1:90]
regional_cases <- regional_cases[which(regional_cases$date %in% dates), ]
head(regional_cases)
OUTPUT
date confirm region
1 2020-01-30 NA East Midlands
2 2020-01-30 NA East of England
3 2020-01-30 2 England
4 2020-01-30 NA London
5 2020-01-30 NA North East
6 2020-01-30 NA North West
To find regional estimates, we use the same inputs as
epinow()
to the function
regional_epinow()
:
R
estimates_regional <- regional_epinow(
# cases
reported_cases = regional_cases,
# delays
generation_time = generation_time_opts(generation_time_fixed),
delays = delay_opts(incubation_period_fixed + reporting_delay_fixed),
# prior
rt = rt_opts(prior = list(mean = rt_log_mean, sd = rt_log_sd))
)
OUTPUT
INFO [2024-05-04 08:57:48] Producing following optional outputs: regions, summary, samples, plots, latest
INFO [2024-05-04 08:57:48] Reporting estimates using data up to: 2020-04-28
INFO [2024-05-04 08:57:48] No target directory specified so returning output
INFO [2024-05-04 08:57:48] Producing estimates for: East Midlands, East of England, England, London, North East, North West, Northern Ireland, Scotland, South East, South West, Wales, West Midlands, Yorkshire and The Humber
INFO [2024-05-04 08:57:48] Regions excluded: none
INFO [2024-05-04 09:42:33] Completed regional estimates
INFO [2024-05-04 09:42:33] Regions with estimates: 13
INFO [2024-05-04 09:42:33] Regions with runtime errors: 0
INFO [2024-05-04 09:42:33] Producing summary
INFO [2024-05-04 09:42:33] No summary directory specified so returning summary output
INFO [2024-05-04 09:42:34] No target directory specified so returning timings
R
estimates_regional$summary$summarised_results$table
OUTPUT
Region New confirmed cases by infection date
<char> <char>
1: East Midlands 345 (212 -- 547)
2: East of England 541 (336 -- 855)
3: England 3496 (2207 -- 5608)
4: London 298 (183 -- 460)
5: North East 252 (149 -- 426)
6: North West 559 (330 -- 879)
7: Northern Ireland 44 (23 -- 83)
8: Scotland 289 (163 -- 517)
9: South East 591 (350 -- 983)
10: South West 417 (295 -- 605)
11: Wales 95 (65 -- 138)
12: West Midlands 271 (143 -- 485)
13: Yorkshire and The Humber 484 (287 -- 777)
Expected change in daily cases Effective reproduction no.
<fctr> <char>
1: Likely increasing 1.2 (0.85 -- 1.6)
2: Likely increasing 1.2 (0.83 -- 1.6)
3: Likely decreasing 0.9 (0.64 -- 1.2)
4: Likely decreasing 0.79 (0.55 -- 1.1)
5: Likely decreasing 0.91 (0.61 -- 1.3)
6: Likely decreasing 0.87 (0.58 -- 1.2)
7: Likely decreasing 0.65 (0.39 -- 1)
8: Likely decreasing 0.92 (0.61 -- 1.4)
9: Stable 0.99 (0.67 -- 1.4)
10: Increasing 1.4 (1.1 -- 1.8)
11: Decreasing 0.57 (0.42 -- 0.76)
12: Likely decreasing 0.71 (0.42 -- 1.1)
13: Stable 1 (0.7 -- 1.4)
Rate of growth Doubling/halving time (days)
<char> <char>
1: 0.024 (-0.02 -- 0.067) 29 (10 -- -34)
2: 0.022 (-0.023 -- 0.066) 31 (10 -- -30)
3: -0.013 (-0.053 -- 0.03) -53 (23 -- -13)
4: -0.029 (-0.069 -- 0.012) -24 (57 -- -10)
5: -0.013 (-0.058 -- 0.036) -55 (19 -- -12)
6: -0.017 (-0.062 -- 0.023) -40 (30 -- -11)
7: -0.051 (-0.1 -- 0.0056) -13 (120 -- -7)
8: -0.011 (-0.058 -- 0.043) -63 (16 -- -12)
9: -0.0011 (-0.048 -- 0.05) -610 (14 -- -15)
10: 0.047 (0.013 -- 0.085) 15 (8.1 -- 54)
11: -0.065 (-0.092 -- -0.033) -11 (-21 -- -7.5)
12: -0.041 (-0.093 -- 0.013) -17 (52 -- -7.5)
13: 0.0037 (-0.043 -- 0.048) 190 (14 -- -16)
R
estimates_regional$summary$plots$R
the i2extras package
i2extras package also estimate transmission metrics like growth rate and doubling/halving time at different time intervals (i.e., days, weeks, or months). i2extras require incidence2 objects as inputs. Read further in its Fitting curves vignette.
Summary
EpiNow2
can be used to estimate transmission metrics
from case data at any time in the course of an outbreak. The reliability
of these estimates depends on the quality of the data and appropriate
choice of delay distributions. In the next tutorial we will learn how to
make forecasts and investigate some of the additional inference options
available in EpiNow2
.
Content from Use delay distributions in analysis
Last updated on 2024-05-04 | Edit this page
Overview
Questions
- How to reuse delays stored in the
{epiparameter}
library with my existing analysis pipeline?
Objectives
- Use distribution functions to continuous and discrete distributions
stored as
<epidist>
objects. - Convert a continuous to a discrete distribution with
{epiparameter}
. - Connect
{epiparameter}
outputs with EpiNow2 inputs.
Prerequisites
- Complete tutorial Quantifying transmission
This episode requires you to be familiar with:
Data science : Basic programming with R.
Statistics : Probability distributions.
Epidemic theory : Epidemiological parameters, time periods, Effective reproductive number.
Introduction
{epiparameter}
help us to choose one specific
set of epidemiological parameters from the literature, instead of
copy/pasting them by hand:
R
covid_serialint <-
epiparameter::epidist_db(
disease = "covid",
epi_dist = "serial",
author = "Nishiura",
single_epidist = TRUE
)
Now, we have an epidemiological parameter we can use in our analysis!
In the chunk below we replaced one of the summary
statistics inputs into EpiNow2::dist_spec()
R
generation_time <-
EpiNow2::dist_spec(
mean = covid_serialint$summary_stats$mean, # replaced!
sd = covid_serialint$summary_stats$sd, # replaced!
max = 20,
distribution = "gamma"
)
In this episode, we will use the distribution
functions that {epiparameter}
provides to get a
maximum value (max
) for this and any other package
downstream in your analysis pipeline!
Let’s load the {epiparameter}
and EpiNow2
package. For EpiNow2, we’ll set 4 cores to be used in
parallel computations. We’ll use the pipe %>%
, some
dplyr verbs and ggplot2, so let’s also
call to the tidyverse package:
R
library(epiparameter)
library(EpiNow2)
library(tidyverse)
withr::local_options(list(mc.cores = 4))
The double-colon
The double-colon ::
in R is used to access functions or
objects from a specific package without loading the entire package into
the current environment. This allows for a more targeted approach to
using package components and helps avoid namespace conflicts.
::
lets you call a specific function from a package by
explicitly mentioning the package name. For example,
dplyr::filter(data, condition)
uses filter()
from the dplyr package without loading the entire
package.
Distribution functions
In R, all the statistical distributions have functions to access the following:
-
density()
: Probability Density function (PDF), -
cdf()
: Cumulative Distribution function (CDF), -
quantile()
: Quantile function, and -
generate()
: Random values from the given distribution.
Functions for the Normal distribution
If you need it, read in detail about the R probability functions for the normal distribution, each of its definitions and identify in which part of a distribution they are located!
If you look at ?stats::Distributions
, each type of
distribution has a unique set of functions. However,
{epiparameter}
gives you the same four functions to access
each of the values above for any <epidist>
object you
want!
R
# plot this to have a visual reference
plot(covid_serialint, day_range = 0:20)
R
# the density value at quantile value of 10 (days)
density(covid_serialint, at = 10)
OUTPUT
[1] 0.01911607
R
# the cumulative probability at quantile value of 10 (days)
cdf(covid_serialint, q = 10)
OUTPUT
[1] 0.9466605
R
# the quantile value (day) at a cumulative probability of 60%
quantile(covid_serialint, p = 0.6)
OUTPUT
[1] 4.618906
R
# generate 10 random values (days) given
# the distribution family and its parameters
generate(covid_serialint, times = 10)
OUTPUT
[1] 3.698099 14.965865 2.538199 3.947991 6.919752 8.699023 2.928458
[8] 19.117628 4.032281 7.400597
Window for contact tracing and the Serial interval
The serial interval is important in the optimisation of contact tracing since it provides a time window for the containment of a disease spread (Fine, 2003). Depending on the serial interval, we can evaluate the need to expand the number of days pre-onset to consider in the contact tracing to include more backwards contacts (Davis et al., 2020).
With the COVID-19 serial interval (covid_serialint
)
calculate:
- How much more of the backward cases could be captured if the contact tracing method considered contacts up to 6 days pre-onset compared to 2 days pre-onset?
In Figure 5 from the R
probability functions for the normal distribution, the shadowed
section represents a cumulative probability of 0.997
for
the quantile value at x = 2
.
R
plot(covid_serialint)
R
cdf(covid_serialint, q = 2)
OUTPUT
[1] 0.1111729
R
cdf(covid_serialint, q = 6)
OUTPUT
[1] 0.7623645
Given the COVID-19 serial interval:
A contact tracing method considering contacts up to 2 days pre-onset will capture around 11.1% of backward cases.
If this period is extended to 6 days pre-onset, this could include 76.2% of backward contacts.
If we exchange the question between days and cumulative probability to:
- When considering secondary cases, how many days following the symptom onset of primary cases can we expect 55% of symptom onset to occur?
R
quantile(covid_serialint, p = 0.55)
An interpretation could be:
- The 55% percent of the symptom onset of secondary cases will happen after 4.2 days after the symptom onset of primary cases.
Discretise a continuous distribution
We are getting closer to the end! EpiNow2::dist_spec()
still needs a maximum value (max
).
One way to do this is to get the quantile value for the
distribution’s 99.9th percentile or 0.999
cumulative
probability. For this, we need access to the set of distribution
functions for our <epidist>
object.
We can use the set of distribution functions for a
continuous distribution (as above). However, these values will
be continuous numbers. We can discretise the
continuous distribution stored in our <epidist>
object to get discrete values from a continuous distribution.
When we epiparameter::discretise()
the continuous
distribution we get a discrete(-ized) distribution:
R
covid_serialint_discrete <-
epiparameter::discretise(covid_serialint)
covid_serialint_discrete
OUTPUT
Disease: COVID-19
Pathogen: SARS-CoV-2
Epi Distribution: serial interval
Study: Nishiura H, Linton N, Akhmetzhanov A (2020). "Serial interval of novel
coronavirus (COVID-19) infections." _International Journal of
Infectious Diseases_. doi:10.1016/j.ijid.2020.02.060
<https://doi.org/10.1016/j.ijid.2020.02.060>.
Distribution: discrete lnorm
Parameters:
meanlog: 1.386
sdlog: 0.568
We identify this change in the Distribution:
output line
of the <epidist>
object. Double check this line:
Distribution: discrete lnorm
While for a continuous distribution, we plot the Probability Density Function (PDF), for a discrete distribution, we plot the Probability Mass Function (PMF):
R
# continuous
plot(covid_serialint)
# discrete
plot(covid_serialint_discrete)
To finally get a max
value, let’s access the quantile
value of the 99.9th percentile or 0.999
probability of the
distribution with the prob_dist$q
notation, similarly to
how we access the summary_stats
values.
R
covid_serialint_discrete_max <-
quantile(covid_serialint_discrete, p = 0.999)
Length of quarantine and Incubation period
The incubation period distribution is a useful delay to assess the length of active monitoring or quarantine (Lauer et al., 2020). Similarly, delays from symptom onset to recovery (or death) will determine the required duration of health care and case isolation (Cori et al., 2017).
Calculate:
- Within what exact time frame do 99% of individuals exhibiting COVID-19 symptoms exhibit them after infection?
What delay distribution measures the time between infection and the onset of symptoms?
The probability functions for <epidist>
discrete distributions are the same that we used for
the continuous ones!
R
# plot to have a visual reference
plot(covid_serialint_discrete, day_range = 0:20)
# density value at quantile value 10 (day)
density(covid_serialint_discrete, at = 10)
# cumulative probability at quantile value 10 (day)
cdf(covid_serialint_discrete, q = 10)
# In what quantile value (days) do we have the 60% cumulative probability?
quantile(covid_serialint_discrete, p = 0.6)
# generate random values
generate(covid_serialint_discrete, times = 10)
R
covid_incubation <-
epiparameter::epidist_db(
disease = "covid",
epi_dist = "incubation",
single_epidist = TRUE
)
OUTPUT
Using McAloon C, Collins Á, Hunt K, Barber A, Byrne A, Butler F, Casey M,
Griffin J, Lane E, McEvoy D, Wall P, Green M, O'Grady L, More S (2020).
"Incubation period of COVID-19: a rapid systematic review and
meta-analysis of observational research." _BMJ Open_.
doi:10.1136/bmjopen-2020-039652
<https://doi.org/10.1136/bmjopen-2020-039652>..
To retrieve the short citation use the 'get_citation' function
R
covid_incubation_discrete <- epiparameter::discretise(covid_incubation)
quantile(covid_incubation_discrete, p = 0.99)
OUTPUT
[1] 16
99% of those who develop COVID-19 symptoms will do so within 16 days of infection.
Now, Is this result expected in epidemiological terms?
From a maximum value with quantile()
, we can create a
sequence of quantile values as a numeric vector and calculate
density()
values for each:
R
# create a discrete distribution visualisation
# from a maximum value from the distribution
quantile(covid_serialint_discrete, p = 0.999) %>%
# generate quantile values
# as a sequence for each natural number
seq(1L, to = ., by = 1L) %>%
# coerce numeric vector to data frame
as_tibble_col(column_name = "quantile_values") %>%
mutate(
# calculate density values
# for each quantile in the density function
density_values =
density(
x = covid_serialint_discrete,
at = quantile_values
)
) %>%
# create plot
ggplot(
aes(
x = quantile_values,
y = density_values
)
) +
geom_col()
Remember: In infections with pre-symptomatic transmission, serial intervals can have negative values (Nishiura et al., 2020). When we use the serial interval to approximate the generation time we need to make this distribution with positive values only!
Plug-in {epiparameter}
to {EpiNow2}
Now we can plug everything into the EpiNow2::dist_spec()
function!
- the summary statistics
mean
andsd
of the distribution, - a maximum value
max
, - the
distribution
name.
However, when using EpiNow2::dist_spec()
, to define a
Lognormal distribution, like the one in the
covid_serialint
object, we need to convert its summary
statistics to distribution parameters named logmean and logsd. With
{epiparameter}
we can directly get the distribution
parameters using epiparameter::get_parameters()
:
R
covid_serialint_parameters <-
epiparameter::get_parameters(covid_serialint)
Then, we have:
R
serial_interval_covid <-
dist_spec(
mean = covid_serialint_parameters["meanlog"],
sd = covid_serialint_parameters["sdlog"],
max = covid_serialint_discrete_max,
distribution = "lognormal"
)
serial_interval_covid
OUTPUT
Fixed distribution with PMF [0.0073 0.1 0.2 0.19 0.15 0.11 0.075 0.051 0.035 0.023 0.016 0.011 0.0076 0.0053 0.0037 0.0027 0.0019 0.0014 0.001 0.00074 0.00055 0.00041 0.00031]
Assuming a COVID-19 scenario, let’s use the first 60 days of the
example_confirmed
data set from the EpiNow2
package as reported_cases
and the recently created
serial_interval_covid
object as inputs to estimate the
time-varying reproduction number using
EpiNow2::epinow()
.
R
epinow_estimates_cg <- epinow(
# cases
reported_cases = example_confirmed[1:60],
# delays
generation_time = generation_time_opts(serial_interval_covid)
)
OUTPUT
WARN [2024-05-04 09:44:07] epinow: There were 1 divergent transitions after warmup. See
https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them. -
WARN [2024-05-04 09:44:07] epinow: Examine the pairs() plot to diagnose sampling problems
-
R
base::plot(epinow_estimates_cg)
The plot()
output includes the estimated cases by date
of infection, which are reconstructed from the reported cases and
delays.
Warning
Using the serial interval instead of the generation time is an alternative that can propagate bias in your estimates, even more so in diseases with reported pre-symptomatic transmission. (Chung Lau et al., 2021)
Adjusting for reporting delays
Estimating \(R_t\) requires data on the daily number of new infections. Due to lags in the development of detectable viral loads, symptom onset, seeking care, and reporting, these numbers are not readily available. All observations reflect transmission events from some time in the past. In other words, if \(d\) is the delay from infection to observation, then observations at time \(t\) inform \(R_{t−d}\), not \(R_t\). (Gostic et al., 2020)
The delay distribution could be inferred jointly
with the underlying times of infection or estimated as the sum of the incubation period distribution and
the distribution of delays from symptom onset to observation from line
list data (reporting delay).
For EpiNow2, we can specify these two complementary delay
distributions in the delays
argument.
Use an Incubation period for COVID-19 to estimate Rt
Estimate the time-varying reproduction number for the first 60 days
of the example_confirmed
data set from
EpiNow2. Access to an incubation period for COVID-19 from
{epiparameter}
to use it as a reporting delay.
Use the last epinow()
calculation using the
delays
argument and the delay_opts()
helper
function.
The delays
argument and the delay_opts()
helper function are analogous to the generation_time
argument and the generation_time_opts()
helper
function.
R
epinow_estimates <- epinow(
# cases
reported_cases = example_confirmed[1:60],
# delays
generation_time = generation_time_opts(covid_serial_interval),
delays = delay_opts(covid_incubation_time)
)
R
# generation time ---------------------------------------------------------
# get covid serial interval
covid_serialint <-
epiparameter::epidist_db(
disease = "covid",
epi_dist = "serial",
author = "Nishiura",
single_epidist = TRUE
)
# adapt epidist to epinow2
covid_serialint_discrete_max <-
covid_serialint %>%
discretise() %>%
quantile(p = 0.999)
covid_serialint_parameters <-
epiparameter::get_parameters(covid_serialint)
covid_serial_interval <-
dist_spec(
mean = covid_serialint_parameters["meanlog"],
sd = covid_serialint_parameters["sdlog"],
max = covid_serialint_discrete_max,
distribution = "lognormal"
)
# incubation time ---------------------------------------------------------
# get covid incubation period
covid_incubation <- epiparameter::epidist_db(
disease = "covid",
epi_dist = "incubation",
author = "Natalie",
single_epidist = TRUE
)
# adapt epiparameter to epinow2
covid_incubation_discrete_max <-
covid_incubation %>%
discretise() %>%
quantile(p = 0.999)
covid_incubation_parameters <-
epiparameter::get_parameters(covid_incubation)
covid_incubation_time <-
dist_spec(
mean = covid_incubation_parameters["meanlog"],
sd = covid_incubation_parameters["sdlog"],
max = covid_incubation_discrete_max,
distribution = "lognormal" # do not forget this!
)
# epinow ------------------------------------------------------------------
# run epinow
epinow_estimates_cgi <- epinow(
# cases
reported_cases = example_confirmed[1:60],
# delays
generation_time = generation_time_opts(covid_serial_interval),
delays = delay_opts(covid_incubation_time)
)
OUTPUT
WARN [2024-05-04 09:46:00] epinow: There were 7 divergent transitions after warmup. See
https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them. -
WARN [2024-05-04 09:46:00] epinow: Examine the pairs() plot to diagnose sampling problems
-
R
base::plot(epinow_estimates_cgi)
Try to complement the delays
argument with a reporting
delay like the reporting_delay_fixed
object of the previous
episode.
How much has it changed?
After adding the incubation period, discuss:
- Does the trend of the model fit in the “Estimate” section change?
- Has the uncertainty changed?
- How would you explain or interpret any of these changes?
Compare all the EpiNow2 figures generated previously.
Challenges
A code completion tip
If we write the []
next to the object
covid_serialint_parameters[]
, within []
we can
use the Tab key ↹ for code
completion feature
This gives quick access to
covid_serialint_parameters["meanlog"]
and
covid_serialint_parameters["sdlog"]
.
We invite you to try this out in code chunks and the R console!
Ebola’s effective reproduction number adjusted by reporting delays
Download and read the Ebola dataset:
- Estimate the effective reproduction number using EpiNow2
- Adjust the estimate by the available reporting delays in
{epiparameter}
- Why did you choose that parameter?
To calculate the \(R_t\) using EpiNow2, we need:
- Aggregated incidence
data
, with confirmed cases per day, and - The
generation
time distribution. - Optionally, reporting
delays
distributions when available (e.g., incubation period).
To get delay distribution using {epiparameter}
we can
use functions like:
epidist_db()
list_distributions()
discretise()
quantile()
R
# read data
# e.g.: if path to file is data/raw-data/ebola_cases.csv then:
ebola_confirmed <-
read_csv(here::here("data", "raw-data", "ebola_cases.csv"))
# list distributions
epidist_db(disease = "ebola") %>%
list_distributions()
R
# generation time ---------------------------------------------------------
# subset one distribution for the generation time
ebola_serial <- epidist_db(
disease = "ebola",
epi_dist = "serial",
single_epidist = TRUE
)
# adapt epiparameter to epinow2
ebola_serial_discrete <- discretise(ebola_serial)
serial_interval_ebola <-
dist_spec(
mean = ebola_serial$summary_stats$mean,
sd = ebola_serial$summary_stats$sd,
max = quantile(ebola_serial_discrete, p = 0.999),
distribution = "gamma"
)
# incubation time ---------------------------------------------------------
# subset one distribution for delay of the incubation period
ebola_incubation <- epidist_db(
disease = "ebola",
epi_dist = "incubation",
single_epidist = TRUE
)
# adapt epiparameter to epinow2
ebola_incubation_discrete <- discretise(ebola_incubation)
incubation_period_ebola <-
dist_spec(
mean = ebola_incubation$summary_stats$mean,
sd = ebola_incubation$summary_stats$sd,
max = quantile(ebola_serial_discrete, p = 0.999),
distribution = "gamma"
)
# epinow ------------------------------------------------------------------
# run epinow
epinow_estimates_egi <- epinow(
# cases
reported_cases = ebola_confirmed,
# delays
generation_time = generation_time_opts(serial_interval_ebola),
delays = delay_opts(incubation_period_ebola)
)
OUTPUT
WARN [2024-05-04 09:49:21] epinow: There were 9 divergent transitions after warmup. See
https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them. -
WARN [2024-05-04 09:49:21] epinow: Examine the pairs() plot to diagnose sampling problems
-
R
plot(epinow_estimates_egi)
EpiNow2::dist_spec()
also accepts Probability Mass
Functions (PMF) from any distribution family. Read the reference guide
on Specify
a distribution.
R
# What parameters are available for Influenza?
epidist_db(disease = "influenza") %>%
list_distributions() %>%
as_tibble() %>%
count(epi_distribution)
OUTPUT
# A tibble: 3 × 2
epi_distribution n
<chr> <int>
1 generation time 1
2 incubation period 15
3 serial interval 1
R
# generation time ---------------------------------------------------------
# Read the generation time
influenza_generation <-
epidist_db(
disease = "influenza",
epi_dist = "generation"
)
influenza_generation
OUTPUT
Disease: Influenza
Pathogen: Influenza-A-H1N1
Epi Distribution: generation time
Study: Lessler J, Reich N, Cummings D, New York City Department of Health and
Mental Hygiene Swine Influenza Investigation Team (2009). "Outbreak of
2009 Pandemic Influenza A (H1N1) at a New York City School." _The New
England Journal of Medicine_. doi:10.1056/NEJMoa0906089
<https://doi.org/10.1056/NEJMoa0906089>.
Distribution: weibull
Parameters:
shape: 2.360
scale: 3.180
R
# EpiNow2 currently accepts Gamma or LogNormal
# other can pass the PMF function
influenza_generation_discrete <-
epiparameter::discretise(influenza_generation)
influenza_generation_max <-
quantile(influenza_generation_discrete, p = 0.999)
influenza_generation_pmf <-
density(
influenza_generation_discrete,
at = 1:influenza_generation_max
)
influenza_generation_pmf
OUTPUT
[1] 0.063123364 0.221349877 0.297212205 0.238968280 0.124851641 0.043094538
[7] 0.009799363
R
# EpiNow2::dist_spec() can also accept the PMF values
generation_time_influenza <-
dist_spec(
pmf = influenza_generation_pmf
)
# incubation period -------------------------------------------------------
# Read the incubation period
influenza_incubation <-
epidist_db(
disease = "influenza",
epi_dist = "incubation",
single_epidist = TRUE
)
# Discretize incubation period
influenza_incubation_discrete <-
epiparameter::discretise(influenza_incubation)
influenza_incubation_max <-
quantile(influenza_incubation_discrete, p = 0.999)
influenza_incubation_pmf <-
density(
influenza_incubation_discrete,
at = 1:influenza_incubation_max
)
influenza_incubation_pmf
OUTPUT
[1] 0.057491512 0.166877052 0.224430917 0.215076318 0.161045462 0.097466092
[7] 0.048419279 0.019900259 0.006795222
R
# EpiNow2::dist_spec() can also accept the PMF values
incubation_time_influenza <-
dist_spec(
pmf = influenza_incubation_pmf
)
# epinow ------------------------------------------------------------------
# Read data
influenza_cleaned <-
outbreaks::influenza_england_1978_school %>%
select(date, confirm = in_bed)
# Run epinow()
epinow_estimates_igi <- epinow(
# cases
reported_cases = influenza_cleaned,
# delays
generation_time = generation_time_opts(generation_time_influenza),
delays = delay_opts(incubation_time_influenza)
)
plot(epinow_estimates_igi)
Next steps
How to get distribution parameters from statistical distributions?
How to get the mean and standard deviation from a generation time
with only distribution parameters but no summary statistics
like mean
or sd
for
EpiNow2::dist_spec()
?
Look at the {epiparameter}
vignette on parameter
extraction and conversion and its use
cases!
How to estimate delay distributions for Disease X?
Refer to this excellent tutorial on estimating the serial interval and incubation period of Disease X accounting for censoring using Bayesian inference with packages like rstan and coarseDataTools.
- Tutorial in English: https://rpubs.com/tracelac/diseaseX
- Tutorial en Español: https://epiverse-trace.github.io/epimodelac/EnfermedadX.html
Then, after you get your estimated values, you can
manually create your own<epidist>
class objects with
epiparameter::epidist()
! Take a look at its reference
guide on “Create an <epidist>
object”!
Key Points
- Use distribution functions with
<epidist>
objects to get summary statistics and informative parameters for public health interventions like the Window for contact tracing and Length of quarantine. - Use
discretise()
to convert continuous to discrete delay distributions. - Use
{epiparameter}
to get reporting delays required in transmissibility estimates.