Content from Read case data


Last updated on 2024-04-29 | Edit this page

Overview

Questions

  • Where do you usually store your outbreak data?
  • How many different data formats can I read?
  • Is it possible to import data from database and health APIs?

Objectives

  • Explain how to import outbreak data from different sources into R environment for analysis.

Prerequisites

This episode requires you to be familiar with:

Data science : Basic programming with R.

Introduction


The initial step in outbreak analysis involves importing the target dataset into the R environment from various sources. Outbreak data is typically stored in files of diverse formats, relational database management systems (RDBMS), or health information system (HIS) application program interfaces (APIs) such as REDCap, DHIS2, etc. The latter option is particularly well-suited for storing institutional health data. This episode will elucidate the process of reading cases from these sources.

Reading from files


Several packages are available for importing outbreak data stored in individual files into R. These include rio, readr from the tidyverse, io, ImportExport, data.table. Together, these packages offer methods to read single or multiple files in a wide range of formats.

The below example shows how to import a csv file into R environment using rio package.

R

library("rio")
library("here")

# read data
# e.g.: if path to file is data/raw-data/ebola_cases.csv then:
ebola_confirmed <- read_csv(here::here("data", "raw-data", "ebola_cases.csv"))

# preview data
head(ebola_confirmed, 5)

OUTPUT

        date confirm
1 2014-05-18       1
2 2014-05-20       2
3 2014-05-21       4
4 2014-05-22       6
5 2014-05-23       1

Similarly, you can import files of other formats such as tsv, xlsx, etc.

Reading compressed data

Take 1 minute: - Is it possible to read compressed data in R?

You can check the full list of supported file formats in the rio package on the package website. Here is a selection of some key ones:

R

rio::install_formats()

R

rio::import(here::here("some", "where", "downto", "path", "file_name.zip"))

Click here to download a zip file containing data for Marburg outbreak and then import it to your working environment.

Reading from databases


The DBI package serves as a versatile interface for interacting with database management systems (DBMS) across different back-ends or servers. It offers a uniform method for accessing and retrieving data from various database systems.

The following code chunk demonstrates how to create a temporary SQLite database in memory, store the case_data as a table within it, and subsequently read from it:

R

library("DBI")
library("RSQLite")

# Create a temporary SQLite database in memory
db_con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

# Store the 'case_data' dataframe as a table named 'cases'
# in the SQLite database
DBI::dbWriteTable(db_con, "cases", case_data)
# Read data from the 'cases' table
result <- DBI::dbReadTable(db_con, "cases")
# Close the database connection
DBI::dbDisconnect(db_con)
# View the result
base::print(utils::head(result))

OUTPUT

   date confirm
1 16208       1
2 16210       2
3 16211       4
4 16212       6
5 16213       1
6 16214       2

This code first establishes a connection to an SQLite database created in memory using dbConnect(). Then, it writes the case_data into a table named ‘cases’ within the database using the dbWriteTable() function. Subsequently, it reads the data from the ‘cases’ table using dbReadTable(). Finally, it closes the database connection with dbDisconnect(). Read this tutorial episode on SQL databases and R for more examples.

Run SQL queries in R using dbplyr

A database interface package optimize memory usage by processing the database before extraction, reducing memory load. Conversely, conducting all data manipulation outside the database (e.g., in our local Rstudio session) can lead to inefficient memory usage and strained system resources.

Read the Introduction to dbplyr vignette to learn how to generate your own queries!

Reading from HIS APIs


Health related data are also increasingly stored in specialized HIS APIs like Fingertips, GoData, REDCap, and DHIS2. In such case one can resort to readepi package, which enables reading data from HIS-APIs.
-[TBC]

Key Points

  • Use {rio}, {io}, {readr} and {ImportExport} to read data from individual files.
  • Use {readepi} to read data form HIS APIs and RDBMS.

Content from Clean and validate


Last updated on 2024-04-29 | Edit this page

Overview

Questions

  • How to clean and standardize case data?
  • How to convert raw dataset into a linelist object?

Objectives

  • Explain how to clean, curate, and standardize case data using {cleanepi} package
  • Demonstrate how to covert case data to linelist data

Introduction


In the process of analyzing outbreak data, it’s essential to ensure that the dataset is clean, curated, standardized, and validate to facilitate accurate and reproducible analysis. This episode focuses on cleaning epidemics and outbreaks data using the cleanepi package, and validate it using the linelist package. For demonstration purposes, we’ll work with a simulated dataset of Ebola cases.

The first step is to import the dataset following the guidelines outlined in the Read case data episode. This involves loading the dataset into our environment and view its structure and content.

R

# Load packages
library("rio")
library("here")

# Read data
# e.g.: if path to file is data/raw-data/simulated_ebola_2.csv then:
raw_ebola_data <- rio::import(
  here::here("data", "raw-data", "simulated_ebola_2.csv")
)

R

# Return first five rows
utils::head(raw_ebola_data, 5)

OUTPUT

  V1 case id         age gender    status date onset date sample
1  1   14905          90      1 confirmed 03/15/2015  06/04/2015
2  2   13043 twenty-five      2            Sep /11/Y  03/01/2014
3  3   14364          54      f      <NA> 09/02/2014  03/03/2015
4  4   14675      ninety   <NA>           10/19/2014  31/ 12 /14
5  5   12648          74      F           08/06/2014  10/10/2016

A quick inspection


Quick exploration and inspection of the dataset are crucial before diving into any analysis tasks. The {cleanepi} package simplifies this process with the scan_data() function. Let’s take a look at how you can use it:

R

library("cleanepi")
cleanepi::scan_data(raw_ebola_data)

OUTPUT

  Field_names  missing numeric     date character logical
1          V1 0.000000  1.0000 0.000000  0.000000       0
2     case id 0.000000  1.0000 0.000000  0.000000       0
3         age 0.064600  0.8348 0.000000  0.100600       0
4      gender 0.157867  0.0472 0.000000  0.794933       0
5      status 0.053533  0.0000 0.000000  0.946467       0
6  date onset 0.000067  0.0000 0.915733  0.084200       0
7 date sample 0.000133  0.0000 0.999867  0.000000       0

The results provides an overview of the content of every column, including column names, and the percent of some data types per column. You can see that the column names in the dataset are descriptive but lack consistency, as some they are composed of multiple words separated by white spaces. Additionally, some columns contain more than one data type, and there are missing values in others.

Common operations


This section demonstrate how to perform some common data cleaning operations using the {cleanepi} package.

Standardizing column names

For this example dataset, standardizing column names typically involves removing spaces and connecting different words with “_”. This practice helps maintain consistency and readability in the dataset. However, the function used for standardizing column names offers more options. Type ?cleanepi::standardize_column_names for more details.

R

sim_ebola_data <- cleanepi::standardize_column_names(raw_ebola_data)
names(sim_ebola_data)

OUTPUT

[1] "v_1"         "case_id"     "age"         "gender"      "status"     
[6] "date_onset"  "date_sample"

Challenge

  • What differences you can observe in the column names?

If you want to maintain certain column names without subjecting them to the standardization process, you can utilize the keep parameter of the standardize_column_names() function. This parameter accepts a vector of column names that are intended to be kept unchanged.

Exercise: Standardize the column names of the input dataset, but keep the “V1” column as is.

Removing irregularities

Raw data may contain irregularities such as duplicated and empty rows and columns, as well as constant columns. remove_duplicates and remove_constants functions from {cleanepi} remove such irregularities as demonstrated in the below code chunk.

R

sim_ebola_data <- cleanepi::remove_constant(sim_ebola_data)
sim_ebola_data <- cleanepi::remove_duplicates(sim_ebola_data)

Note that, our simulated Ebola does not contain duplicated nor constant rows or columns.

Replacing missing values

In addition to the regularities, raw data can contain missing values that may be encoded by different strings, including the empty. To ensure robust analysis, it is a good practice to replace all missing values by NA in the entire dataset. Below is a code snippet demonstrating how you can achieve this in {cleanepi}:

R

sim_ebola_data <- cleanepi::replace_missing_values(sim_ebola_data)

Validating subject IDs

Each entry in the dataset represents a subject and should be distinguishable by a specific column formatted in a particular way, such as falling within a specified range, containing certain prefixes and/or suffixes, containing a specific number of characters. The {cleanepi} package offers the check_subject_ids function designed precisely for this task as shown in the below code chunk. This function validates whether they are unique and meet the required criteria.

R

# remove this chunk code once {cleanepi} is updated.
# The coercion made here will be accounted for within {cleanepi}
sim_ebola_data$case_id <- as.character(sim_ebola_data$case_id)

R

sim_ebola_data <- cleanepi::check_subject_ids(sim_ebola_data,
  target_columns = "case_id",
  range = c(0, 15000)
)

OUTPUT

Found 1957 duplicated rows. Please consult the report for more details.

Note that our simulated dataset does contain duplicated subject IDS.

Standardizing dates

Certainly an epidemic dataset contains date columns for different events, such as the date of infection, date of symptoms onset, ..etc, and these dates can come in different date forms, and it good practice to unify them. The {cleanepi} package provides functionality for converting date columns in epidemic datasets into ISO format, ensuring consistency across the different date columns. Here’s how you can use it on our simulated dataset:

R

sim_ebola_data <- cleanepi::standardize_dates(
  sim_ebola_data,
  target_columns = c(
    "date_onset",
    "date_sample"
  )
)

utils::head(sim_ebola_data)

OUTPUT

  v_1 case_id         age gender    status date_onset date_sample
1   1   14905          90      1 confirmed 2015-03-15  2015-04-06
2   2   13043 twenty-five      2      <NA>       <NA>  2014-01-03
3   3   14364          54      f      <NA> 2014-02-09  2015-03-03
4   4   14675      ninety   <NA>      <NA> 2014-10-19  2014-12-31
5   5   12648          74      F      <NA> 2014-06-08  2016-10-10
6   6   14274 seventy-six female      <NA>       <NA>  2016-01-23

This function coverts the values in the target columns, or will automatically figure out the date columns within the dataset (if target_columns = NULL) and convert them into the Ymd format.

Converting to numeric values

In the raw dataset, some column can come with mixture of character and numerical values, and you want to covert the character values explicitly into numeric. For example, in our simulated data set, in the age column some entries are written in words. The convert_to_numeric() function in {cleanepi} does such conversion as illustrated in the below code chunk.

R

sim_ebola_data <- cleanepi::convert_to_numeric(sim_ebola_data,
  target_columns = "age"
)
utils::head(sim_ebola_data)

OUTPUT

  v_1 case_id age gender    status date_onset date_sample
1   1   14905  90      1 confirmed 2015-03-15  2015-04-06
2   2   13043  25      2      <NA>       <NA>  2014-01-03
3   3   14364  54      f      <NA> 2014-02-09  2015-03-03
4   4   14675  90   <NA>      <NA> 2014-10-19  2014-12-31
5   5   12648  74      F      <NA> 2014-06-08  2016-10-10
6   6   14274  76 female      <NA>       <NA>  2016-01-23

Multiple operations at once


Performing data cleaning operations individually can be time-consuming and error-prone. The {cleanepi} package simplifies this process by offering a convenient wrapper function called clean_data(), which allows you to perform multiple operations at once.

The clean_data() function applies a series of predefined data cleaning operations to the input dataset. Here’s an example code chunk illustrating how to use clean_data() on a raw simulated Ebola dataset:

Further more, you can combine multiple data cleaning tasks via the pipe operator in “|>”, as shown in the below code snippet.

R

# remove the line below once Karim has updated cleanepi
raw_ebola_data$`case id` <- as.character(raw_ebola_data$`case id`)
# PERFORM THE OPERATIONS USING THE pipe SYNTAX
cleaned_data <- raw_ebola_data |>
  cleanepi::standardize_column_names(keep = "V1", rename = NULL) |>
  cleanepi::replace_missing_values(target_columns = NULL) |>
  cleanepi::remove_constant(cutoff = 1.0) |>
  cleanepi::remove_duplicates(target_columns = NULL) |>
  cleanepi::standardize_dates(
    target_columns = c("date_onset", "date_sample"),
    error_tolerance = 0.4,
    format = NULL,
    timeframe = NULL
  ) |>
  cleanepi::check_subject_ids(
    target_columns = "case_id",
    range = c(1, 15000)
  ) |>
  cleanepi::convert_to_numeric(target_columns = "age") |>
  cleanepi::clean_using_dictionary(dictionary = test_dict)

OUTPUT

Found 1957 duplicated rows. Please consult the report for more details.

Printing the clean report


The {cleanepi} package generates a comprehensive report detailing the findings and actions of all data cleansing operations conducted during the analysis. This report is presented as a webpage with multiple sections. Each section corresponds to a specific data cleansing operation, and clicking on each section allows you to access the results of that particular operation. This interactive approach enables users to efficiently review and analyze the outcomes of individual cleansing steps within the broader data cleansing process.

You can view the report using cleanepi::print_report() function.

Example of data cleaning report generated by {cleanepi}
Example of data cleaning report generated by {cleanepi}

Validating and tagging case data


In outbreak analysis, once you have completed the initial steps of reading and cleaning the case data, it’s essential to establish an additional foundational layer to ensure the integrity and reliability of subsequent analyses. Specifically, this involves verifying the presence and correct data type of certain input columns within your dataset, a process commonly referred to as “tagging.” Additionally, it’s crucial to implement measures to validate that these tagged columns are not inadvertently deleted during further data processing steps.

This is achieved by converting the cleaned case data into a linelist object using linelist package, see the below code chunk.

R

library("linelist")
data <- linelist::make_linelist(cleaned_data,
  id = "case_id",
  age = "age",
  date_onset = "date_onset",
  date_reporting = "date_sample",
  gender = "gender"
)
utils::head(data, 7)

OUTPUT


// linelist object
  V1 case_id age gender    status date_onset date_sample
1  1   14905  90   male confirmed 2015-03-15  2015-04-06
2  2   13043  25 female      <NA>       <NA>  2014-01-03
3  3   14364  54 female      <NA> 2014-02-09  2015-03-03
4  4   14675  90   <NA>      <NA> 2014-10-19  2014-12-31
5  5   12648  74 female      <NA> 2014-06-08  2016-10-10
6  6   14274  76 female      <NA>       <NA>  2016-01-23
7  7   14132  16   male confirmed       <NA>  2015-10-05

// tags: id:case_id, date_onset:date_onset, date_reporting:date_sample, gender:gender, age:age 

Key Points

  • Use {cleanepi} package to clean and standardize epidemic and outbreak data
  • Use linelist to tagg, validate, and prepare case data for downstream analysis.

Content from Aggregate and visulaize


Last updated on 2024-04-29 | Edit this page

Overview

Questions

  • How to aggregate case data?
  • How to visualize aggregated data?
  • What is distribution of cases in time, place, gender, age?

Objectives

  • Simulate synthetic outbreak data
  • Convert linelist data to incidence
  • Create epidemic curves from incidence data

Introduction


In an analytic pipeline, exploratory data analysis (EDA) is an important step before formal modelling. EDA helps determine relationships between variables and summarize their main characteristics often by means of data visualization.

This episode focuses on EDA of outbreaks and epidemic data, and how to achieved that using a couples of handy R packages. A key observation in EDA of epidemic analysis is capturing the relationship between time and the number of reported cases, spanning various categories (confirmed, hospitalized, deaths, and recoveries), locations, and other demographic factors such as gender, age, etc.

Synthetic outbreak data


To illustrate the process of conducting EDA on outbreak data, we will generate a line list for a hypothetical Ebola outbreak utilizing the {simulist} package. This line list dataset offers individual-level information about the outbreak. For our simulation, we will assume that the dynamics of this outbreak are influenced by several factors: the contact distribution (average number of contacts for an infected case), distribution of contact intervals (time period between contacts), and the delay distributions of onset to hospitalization and onset to death. These latter distributions can be sourced from literature and are conveniently available in the {epiparameter} package, see the below code chunk.

R

# Load simulist and epiparameter packages
library("simulist")
library("epiparameter")

# Define contact distribution
contact_dist <- epiparameter::epidist(
  disease = "Ebola",
  epi_dist = "contact distribution",
  prob_distribution = "pois",
  prob_distribution_params = c(mean = 2)
)

# Define  distribution for interval between contact
cont_interval <- epiparameter::epidist(
  disease = "Ebola",
  epi_dist = "contact interval",
  prob_distribution = "gamma",
  prob_distribution_params = c(shape = 1, scale = 1)
)

# Define onset to hospitalized distribution
onset_to_hosp <- contact_dist <- epiparameter::epidist(
  disease = "Ebola",
  epi_dist = "onset to hospitalisatio",
  prob_distribution = "pois",
  prob_distribution_params = c(mean = 7)
)

# get onset to death from {epiparameter} database
onset_to_death <- epiparameter::epidist_db(
  disease = "Ebola",
  epi_dist = "onset to death",
  single_epidist = TRUE
)

# Define distribution for infectious period
infect_period <- epiparameter::epidist(
  disease = "Ebola",
  epi_dist = "Infectious period",
  prob_distribution = "gamma",
  prob_distribution_params = c(shape = 1, scale = 1)
)

Additionally, we assume that the outbreak started at the beginning of 2023, is highly contagious with a probability of infection of \(80\%\), and its minimum and maximum sizes are 1000 and 10,000, respectively. Combining these assumptions with the mentioned distributions, the code chunk below generates a simulated line list:

R

# Set seed to 1 to  have the same results
base::set.seed(1)

# Generate simulation data using the defined distribution.
linelist <- simulist::sim_linelist(
  contact_dist,
  infect_period,
  prob_infect = 0.6,
  onset_to_hosp,
  onset_to_death,
  hosp_risk = 0.2,
  hosp_death_risk = 0.5,
  non_hosp_death_risk = 0.05,
  outbreak_start_date = as.Date("2023-01-01"),
  add_names = TRUE,
  add_ct = TRUE,
  outbreak_size = c(1000, 10000),
  population_age = c(1, 90),
  case_type_probs = c(suspected = 0.2, probable = 0.1, confirmed = 0.7),
  config = simulist::create_config()
)

# View first few rows of the generated data
utils::head(linelist)

OUTPUT

  id                  case_name case_type sex age date_onset date_admission
1  1           Keegan Hardy-Roy confirmed   m   3 2023-01-01           <NA>
2  6    Alexandria Torres-Perez  probable   f  13 2023-01-01           <NA>
3  7          Mitchell Reinhart confirmed   m  74 2023-01-01     2023-01-08
4  9 Stephanie Loadman-Copeland confirmed   f  65 2023-01-01     2023-01-10
5 10           Sufyaan al-Irani confirmed   m   8 2023-01-02           <NA>
6 11               Rohan Nguyen confirmed   m  27 2023-01-01           <NA>
  date_death date_first_contact date_last_contact ct_value
1       <NA>               <NA>              <NA>     24.9
2       <NA>         2022-12-30        2023-01-02       NA
3       <NA>         2022-12-31        2023-01-04     24.9
4       <NA>         2023-01-03        2023-01-04     24.9
5       <NA>         2022-12-31        2023-01-02     24.9
6       <NA>         2023-01-01        2023-01-05     24.9

Aggregating


Downstream analysis involves working with aggregated data rather than individual cases. This requires grouping linelist data in the form of incidence data. The incidence2 package offers an essential function, called incidence, for grouping case data, usually centered around dated events and/or other factors. The code chunk provided below demonstrates the creation of an incidence2 object from the simulated Ebola linelist data based on the date of onset.

R

# load incidence2 package
library("incidence2")

# create incidence object by aggregating case data  based on the date of onset
dialy_incidence_data <- incidence2::incidence(
  linelist,
  date_index = "date_onset",
  interval = 1
)

# View the first incidence data for the first 5 days
utils::head(dialy_incidence_data, 5)

OUTPUT

# incidence:  5 x 3
# count vars: date_onset
  date_index count_variable count
* <period>   <chr>          <int>
1 2023-01-01 date_onset       475
2 2023-01-02 date_onset      4904
3 2023-01-03 date_onset      5478
4 2023-01-04 date_onset      5319
5 2023-01-05 date_onset      3520

Furthermore, with the incidence2 package, you can specify the desired interval and categorize cases by one or more factors. Below is a code snippet demonstrating weekly cases grouped by the date of onset and gender.

R

# Grouping data by week
weekly_incidence_data <- incidence2::incidence(
  linelist,
  date_index = "date_onset",
  interval = 7,
  groups = c("sex", "case_type")
)

# View incidence data for the first 5 weeks
utils::head(weekly_incidence_data, 5)

OUTPUT

# incidence:  5 x 5
# count vars: date_onset
# groups:     sex, case_type
  date_index               sex   case_type count_variable count
* <period>                 <chr> <chr>     <chr>          <int>
1 2022-12-29 to 2023-01-04 f     confirmed date_onset      5576
2 2022-12-29 to 2023-01-04 f     probable  date_onset       766
3 2022-12-29 to 2023-01-04 f     suspected date_onset      1612
4 2022-12-29 to 2023-01-04 m     confirmed date_onset      5760
5 2022-12-29 to 2023-01-04 m     probable  date_onset       787

Notes

When cases are grouped by different factors, it’s possible that these groups may have different date ranges in the resulting incidence2 object. The incidence2 package provides a function called complete_dates() to ensure that an incidence object has the same range of dates for each group. By default, missing counts will be filled with 0.

R

# Create incidence object
dialy_incidence_data_2 <- incidence2::incidence(
  linelist,
  date_index = "date_onset",
  groups = "sex",
  interval = 1
)

# Complete missing dates in the incidence object
incidence2::complete_dates(dialy_incidence_data_2,
  expand = TRUE,
  fill = 0L, by = 1L,
  allow_POSIXct = FALSE
)

OUTPUT

# incidence:  24 x 4
# count vars: date_onset
# groups:     sex
   date_index sex   count_variable count
   <period>   <chr> <chr>          <int>
 1 2023-01-01 f     date_onset       242
 2 2023-01-01 m     date_onset       233
 3 2023-01-02 f     date_onset      2403
 4 2023-01-02 m     date_onset      2501
 5 2023-01-03 f     date_onset      2709
 6 2023-01-03 m     date_onset      2769
 7 2023-01-04 f     date_onset      2600
 8 2023-01-04 m     date_onset      2719
 9 2023-01-05 f     date_onset      1763
10 2023-01-05 m     date_onset      1757
# ℹ 14 more rows

Visualization


“The incidence2 object can be visualized using the plot() function from the base R package. The resulting graph is referred to as an epidemic curve, or epi-curve for short. The following code snippets generate epi-curves for the dialy_incidence_data and weekly_incidence_data incidence objects mentioned above.”

R

# Load ggplot2 and tracetheme packages
library("ggplot2")
library("tracetheme")

# Plot daily incidence data
base::plot(dialy_incidence_data) + ggplot2::labs(
  x = "Time (in days)",
  y = "Dialy cases"
) + tracetheme::theme_trace()

R

# Plot weekly incidence data

base::plot(weekly_incidence_data) + ggplot2::labs(
  x = "Time (in days)",
  y = "weekly cases"
) + tracetheme::theme_trace()

Challenge 1: Can you do it?

  • Using suitable distributions for contacts, contact interval, infectious period, onset to hospitalized, and onset to death, generate a simulated linelist data for Marburg outbreak that has the probability of \(0.5\) infection?
  • Aggregate the generated linelist and produce some epidemic curves?

Key Points

  • Use {simulist} package to generate synthetic outbreak data
  • Use incidence2 package to aggregate case data based on a date event, and produce epidemic curves.

Content from Simple analysis


Last updated on 2024-04-29 | Edit this page

Overview

Questions

  • What is the growth rate of the epidemic?
  • How to identify the peak time of an outbreak?
  • How to compute moving average of cases?

Objectives

  • Perform an early analysis of outbreak data
  • Identify trends, exponential, doubling, and peak time

Introduction


Understanding the trend in case data is crucial for various purposes, such as forecasting future case counts, implementing public health interventions, and assessing the effectiveness of control measures. By analyzing the trend, policymakers and public health experts can make informed decisions to mitigate the spread of diseases and protect public health. This episode focuses on how to perform a simple early analysis on incidence data. It uses the same dataset of Covid-19 case data from England that utilized it in Aggregate and visualize episode.

Simple model


Aggregated case data over a specific time unit, or incidence data, typically represent the number of cases occurring within that time frame. These data can often be assumed to follow either Poisson distribution or a negative binomial (NB) distribution, depending on the specific characteristics of the data and the underlying processes generating them. When analyzing such data, one common approach is to examine the trend over time by computing the rate of change, which can indicate whether there is exponential growth or decay in the number of cases. Exponential growth implies that the number of cases is increasing at an accelerating rate over time, while exponential decay suggests that the number of cases is decreasing at a decelerating rate.

The i2extras package provides methods for modelling the trend in case data, calculating moving averages, and exponential growth or decay rate. The code chunk below computes the Covid-19 trend in UK within first 3 months using negative binomial distribution.

R

# loads the i2extras package, which provides methods for modeling
library("i2extras")
# This line loads the i2extras package, which provides methods for modeling
library("incidence2")
# subset the covid19_eng_case_data to include only the first 3 months of data
covid19_eng_case_data <- outbreaks::covid19_england_nhscalls_2020
df <- base::subset(
  covid19_eng_case_data,
  covid19_eng_case_data$date <= min(covid19_eng_case_data$date) + 90
)
# uses the incidence function from the incidence2 package to compute the
# incidence data
df_incid <- incidence2::incidence(df, date_index = "date", groups = "sex")

# fit a curve to the incidence data. The model chosen is the negative binomial
# distribution with a significance level (alpha) of 0.05.
fitted_curve_nb <- i2extras::fit_curve(df_incid, model = "negbin", alpha = 0.05)
base::plot(fitted_curve_nb, angle = 45) + ggplot2::labs(x = "Date", y = "Cases")

Challenge 1: Poission distribution

Repeat the above analysis using Poisson distribution?

R

fitted_curve_poisson <- i2extras::fit_curve(df_incid, model = "poisson",
                                            alpha = 0.05)
base::plot(fitted_curve_poisson, angle = 45) +
  ggplot2::labs(x = "Date", y = "Cases")

Exponential growth or decay rate


The exponential growth or decay rate, denoted as \(r\), serves as an indicator for the trend in cases, indicating whether they are increasing (growth) or decreasing (decay) on an exponential scale. This rate is computed using the so-called renewal equation (Wallinga et al. 2006), which mechanistically links the reproductive number \(R\) of new cases to the generation interval of the disease. This computational method is implemented in the i2extras package.

Below is a code snippet demonstrating how to extract the growth/decay rate from the above NB-fitted curve using the growth_rate() function:

R

rates_nb <- i2extras::growth_rate(fitted_curve_nb)
rates_nb <- base::as.data.frame(rates_nb) |>
  subset(select = c(sex, r, r_lower, r_upper))
base::print(rates_nb)

OUTPUT

      sex            r      r_lower      r_upper
1  female -0.008241228 -0.009182635 -0.007300403
2    male -0.008346783 -0.009316775 -0.007377392
3 unknown -0.023703987 -0.028179436 -0.019299926

Challenge 2: Growth rates fromPoisson-fitted curve

Extract growth rates from the Poisson-fitted curve of Challenge 1?

Peak time


The Peak time is the time at which the highest number of cases is observed in the aggregated data. It can be estimated using the i2extras::estimate_peak() function as shown in the below code chunk, which identify peak time from the incidenc2 object df_incid.

R

peaks_nb <- i2extras::estimate_peak(df_incid, progress = FALSE) |>
  subset(select = -c(count_variable, bootstrap_peaks))
base::print(peaks_nb)

OUTPUT

# A data frame: 3 × 6
  sex     observed_peak observed_count lower_ci   median     upper_ci  
* <chr>   <date>                 <int> <date>     <date>     <date>    
1 female  2020-03-26              1314 2020-03-18 2020-03-23 2020-03-29
2 male    2020-03-27              1299 2020-03-18 2020-03-23 2020-03-30
3 unknown 2020-04-10                32 2020-03-24 2020-04-10 2020-04-16

Moving average


A moving or rolling average calculates the average number of cases within a specified time period. This can be achieved by utilizing the add_rolling_average() function from the i2extras package on an incidence2 object. The following code chunk demonstrates the computation of the weekly average number of cases from the incidence2 object df_incid, followed by visualization.

R

library("ggplot2")
moving_Avg_week <- i2extras::add_rolling_average(df_incid, n = 7L)
base::plot(moving_Avg_week, border_colour = "white", angle = 45) +
  ggplot2::geom_line(ggplot2::aes(x = date_index, y = rolling_average,
                                  color = "red")) +
  ggplot2::labs(x = "Date", y = "Cases")

Challenge 3: Monthly moving average

Compute and visualize the monthly moving average of cases on df_incid?

R

moving_Avg_mont <- i2extras::add_rolling_average(df_incid, n = 30L)
base::plot(moving_Avg_mont, border_colour = "white", angle = 45) +
  ggplot2::geom_line(ggplot2::aes(x = date_index, y = rolling_average,
                                  color = "red")) +
  ggplot2::labs(x = "Date", y = "Cases")

Key Points

  • Use i2extras to:
    • fit epi-curve using either Poisson or NB distributions,
    • calculate exponential growth or decline of cases,
    • find peak time, and
    • computing moving average of cases in specified time window.

Content from Access epidemiological delay distributions


Last updated on 2024-05-04 | Edit this page

Overview

Questions

  • How to get access to disease delay distributions from a pre-established database for use in analysis?

Objectives

  • Get delays from a literature search database with {epiparameter}.
  • Get distribution parameters and summary statistics of delay distributions.

Prerequisites

This episode requires you to be familiar with:

Data science : Basic programming with R.

Epidemic theory : epidemiological parameters, disease time periods, such as the incubation period, generation time, and serial interval.

Introduction


Infectious diseases follow an infection cycle, which usually includes the following phases: presymptomatic period, symptomatic period and recovery period, as described by their natural history. These time periods can be used to understand transmission dynamics and inform disease prevention and control interventions.

Definition of key time periods. From Xiang et al, 2021
Definition of key time periods. From Xiang et al, 2021

Definitions

Look at the glossary for the definitions of all the time periods of the figure above!

However, early in an epidemic, modelling efforts can be delayed by the lack of a centralised resource that summarises input parameters for the disease of interest (Nash et al., 2023). Projects like {epiparameter} and {epireview} are building online catalogues following literature synthesis protocols that can help parametrise models by easily accessing a comprenhensive library of previously estimated epidemiological parameters from past outbreaks.

To exemplify how to use the {epiparameter} R package in your analysis pipeline, our goal in this episode will be to choose one specific set of epidemiological parameters from the literature, instead of copying-and-pasting them by hand, to plug them into an EpiNow2 analysis workflow.

Let’s start loading the {epiparameter} package. We’ll use the pipe %>% to connect some of their functions, some tibble and dplyr functions, so let’s also call to the tidyverse package:

R

library(epiparameter)
library(tidyverse)

The problem


If we want to estimate the transmissibility of an infection, it’s common to use a package such as EpiEstim or EpiNow2. However, both require some epidemiological information as an input. For example, in EpiNow2 we use EpiNow2::dist_spec() to specify a generation time as a probability distribution adding its mean, standard deviation (sd), and maximum value (max). To specify a generation_time that follows a Gamma distribution with mean \(\mu = 4\), standard deviation \(\sigma = 2\), and a maximum value of 20, we write:

R

generation_time <- 
  EpiNow2::dist_spec(
    mean = 4,
    sd = 2,
    max = 20,
    distribution = "gamma"
  )

It is a common practice for analysts to manually search the available literature and copy and paste the summary statistics or the distribution parameters from scientific publications. A challenge that is often faced is that the reporting of different statistical distributions is not consistent across the literature. {epiparameter}’s objective is to facilitate the access to reliable estimates of distribution parameters for a range of infectious diseases, so that they can easily be implemented in outbreak analytic pipelines.

In this episode, we will choose the summary statistics from the library of epidemiological parameters provided by {epiparameter}.

Generation time vs serial interval


The generation time, jointly with the reproduction number (\(R\)), provide valuable insights on the strength of transmission and inform the implementation of control measures. Given a \(R>1\), the shorter the generation time, the earlier the incidence of disease cases will grow.

Video from the MRC Centre for Global Infectious Disease Analysis, Ep 76. Science In Context - Epi Parameter Review Group with Dr Anne Cori (27-07-2023) at https://youtu.be/VvpYHhFDIjI?si=XiUyjmSV1gKNdrrL
Video from the MRC Centre for Global Infectious Disease Analysis, Ep 76. Science In Context - Epi Parameter Review Group with Dr Anne Cori (27-07-2023) at https://youtu.be/VvpYHhFDIjI?si=XiUyjmSV1gKNdrrL

In calculating the effective reproduction number (\(R_{t}\)), the generation time distribution is often approximated by the serial interval distribution. This frequent approximation is because it is easier to observe and measure the onset of symptoms than the onset of infectiousness.

A schematic of the relationship of different time periods of transmission between an infector and an infectee in a transmission pair. Exposure window is defined as the time interval having viral exposure, and transmission window is defined as the time interval for onward transmission with respect to the infection time (Chung Lau et al., 2021).
A schematic of the relationship of different time periods of transmission between an infector and an infectee in a transmission pair. Exposure window is defined as the time interval having viral exposure, and transmission window is defined as the time interval for onward transmission with respect to the infection time (Chung Lau et al., 2021).

However, using the serial interval as an approximation of the generation time is primarily valid for diseases in which infectiousness starts after symptom onset (Chung Lau et al., 2021). In cases where infectiousness starts before symptom onset, the serial intervals can have negative values, which is the case for diseases with pre-symptomatic transmission (Nishiura et al., 2020).

From time periods to probability distributions.

When we calculate the serial interval, we see that not all case pairs have the same time length. We will observe this variability for any case pair and individual time period, including the incubation period and infectious period.

Serial intervals of possible case pairs in (a) COVID-19 and (b) MERS-CoV. Pairs represent a presumed infector and their presumed infectee plotted by date of symptom onset (Althobaity et al., 2022).
Serial intervals of possible case pairs in (a) COVID-19 and (b) MERS-CoV. Pairs represent a presumed infector and their presumed infectee plotted by date of symptom onset (Althobaity et al., 2022).

To summarise these data from individual and pair time periods, we can find the statistical distributions that best fit the data (McFarland et al., 2023).

Fitted serial interval distribution for (a) COVID-19 and (b) MERS-CoV based on reported transmission pairs in Saudi Arabia. We fitted three commonly used distributions, Lognormal, Gamma, and Weibull distributions, respectively (Althobaity et al., 2022).
Fitted serial interval distribution for (a) COVID-19 and (b) MERS-CoV based on reported transmission pairs in Saudi Arabia. We fitted three commonly used distributions, Lognormal, Gamma, and Weibull distributions, respectively (Althobaity et al., 2022).

Statistical distributions are summarised in terms of their summary statistics like the location (mean and percentiles) and spread (variance or standard deviation) of the distribution, or with their distribution parameters that inform about the form (shape and rate/scale) of the distribution. These estimated values can be reported with their uncertainty (95% confidence intervals).

Gamma mean shape rate/scale
MERS-CoV 14.13(13.9–14.7) 6.31(4.88–8.52) 0.43(0.33–0.60)
COVID-19 5.1(5.0–5.5) 2.77(2.09–3.88) 0.53(0.38–0.76)
Weibull mean shape rate/scale
MERS-CoV 14.2(13.3–15.2) 3.07(2.64–3.63) 16.1(15.0–17.1)
COVID-19 5.2(4.6–5.9) 1.74(1.46–2.11) 5.83(5.08–6.67)
Serial interval estimates using Gamma, Weibull, and Log normal distributions. 95% confidence intervals for the shape and scale (logmean and sd for Log normal) parameters are shown in brackets (Althobaity et al., 2022).
Log normal mean mean-log sd-log
MERS-CoV 14.08(13.1–15.2) 2.58(2.50–2.68) 0.44(0.39–0.5)
COVID-19 5.2(4.2–6.5) 1.45(1.31–1.61) 0.63(0.54–0.74)

Serial interval

Assume that COVID-19 and SARS have similar reproduction number values and that the serial interval approximates the generation time.

Given the Serial interval of both infections in the figure below:

  • Which one would be harder to control?
  • Why do you conclude that?
Serial interval of novel coronavirus (COVID-19) infections overlaid with a published distribution of SARS. (Nishiura et al., 2020)
Serial interval of novel coronavirus (COVID-19) infections overlaid with a published distribution of SARS. (Nishiura et al., 2020)

The peak of each curve can inform you about the location of the mean of each distribution. The larger the mean, the larger the serial interval.

Which one would be harder to control?

COVID-19

Why do you conclude that?

COVID-19 has the lowest mean serial interval. The approximate mean value for the serial interval of COVID-19 is around four days, and SARS is about seven days. Thus, COVID-19 will likely have newer generations in less time than SARS, assuming similar reproduction numbers.

Choosing epidemiological parameters


In this section, we will use {epiparameter} to obtain the generation time and the serial interval for COVID-19, so these metrics can be used to estimate the transmissibility of this disease using EpiNow2 in subsequent sections of this episode.

Let’s start by looking at how many entries are available in the epidemiological distributions database in {epiparameter} (epidist_db) for the disease named covid-19:

R

epiparameter::epidist_db(
  disease = "covid"
)

OUTPUT

Returning 27 results that match the criteria (22 are parameterised). 
Use subset to filter by entry variables or single_epidist to return a single entry. 
To retrieve the short citation for each use the 'get_citation' function

OUTPUT

List of <epidist> objects
  Number of entries in library: 27
  Number of studies in library: 10
  Number of diseases: 1
  Number of delay distributions: 27
  Number of offspring distributions: 0

The double-colon

The double-colon :: in R is used to access functions or objects from a specific package without loading the entire package into the current environment. This allows for a more targeted approach to using package components and helps avoid namespace conflicts.

:: lets you call a specific function from a package by explicitly mentioning the package name. For example, dplyr::filter(data, condition) uses filter() from the dplyr package without loading the entire package.

From the {epiparameter} package, we can use the epidist_db() function to ask for any disease and also for a specific epidemiological distribution (epi_dist).

Let’s ask now how many parameters we have in the epidemiological distributions database (epidist_db) with the generation time using the string generation:

R

epiparameter::epidist_db(
  epi_dist = "generation"
)

OUTPUT

Returning 1 results that match the criteria (1 are parameterised). 
Use subset to filter by entry variables or single_epidist to return a single entry. 
To retrieve the short citation for each use the 'get_citation' function

OUTPUT

Disease: Influenza
Pathogen: Influenza-A-H1N1
Epi Distribution: generation time
Study: Lessler J, Reich N, Cummings D, New York City Department of Health and
Mental Hygiene Swine Influenza Investigation Team (2009). "Outbreak of
2009 Pandemic Influenza A (H1N1) at a New York City School." _The New
England Journal of Medicine_. doi:10.1056/NEJMoa0906089
<https://doi.org/10.1056/NEJMoa0906089>.
Distribution: weibull
Parameters:
  shape: 2.360
  scale: 3.180

Currently, in the library of epidemiological parameters, we have one generation time entry for Influenza. Considering the above-mentioned considerations, we can look at the serial intervals for COVID-19. Run this locally!

R

epiparameter::epidist_db(
  disease = "COVID",
  epi_dist = "serial"
)

With this query combination, we get more than one delay distribution. This output is an <epidist> class object.

CASE-INSENSITIVE

epidist_db is case-insensitive. This means that you can use strings with letters in upper or lower case indistinctly. Strings like "serial", "serial interval" or "serial_interval" are also valid.

To summarise an <epidist> object and get the column names from the underlying parameter database, we can add the epiparameter::list_distributions() function to the previous code using the pipe %>%:

R

epiparameter::epidist_db(
  disease = "covid",
  epi_dist = "serial"
) %>%
  epiparameter::list_distributions()

OUTPUT

Returning 4 results that match the criteria (3 are parameterised). 
Use subset to filter by entry variables or single_epidist to return a single entry. 
To retrieve the short citation for each use the 'get_citation' function

OUTPUT

   disease epi_distribution prob_distribution       author year
1 COVID-19  serial interval              <NA> Muluneh .... 2021
2 COVID-19  serial interval             lnorm Hiroshi .... 2020
3 COVID-19  serial interval           weibull Hiroshi .... 2020
4 COVID-19  serial interval              norm Lin Yang.... 2020

In the epiparameter::list_distributions() output, we can also find different types of probability distributions (e.g., Log-normal, Weibull, Normal).

{epiparameter} uses the base R naming convention for distributions. This is why Lognormal is called lnorm.

Entries with a missing value (<NA>) in the prob_distribution column are non-parameterised entries. They have summary statistics but no probability distribution. Compare these two outputs:

R

# get an <epidist> object
distribution <-
  epiparameter::epidist_db(
    disease = "covid",
    epi_dist = "serial"
  )

distribution %>%
  # pluck the first entry in the object class <list>
  pluck(1) %>%
  # check if <epidist> object have distribution parameters
  is_parameterised()

# check if the second <epidist> object
# have distribution parameters
distribution %>%
  pluck(2) %>%
  is_parameterised()

Parameterised entries have an Inference method

As detailed in ?is_parameterised, a parameterised distribution is the entry that has a probability distribution associated with it provided by an inference_method as shown in metadata:

R

distribution[[1]]$metadata$inference_method
distribution[[2]]$metadata$inference_method
distribution[[4]]$metadata$inference_method

Find your delay distributions!

Take 2 minutes to explore the {epiparameter} library.

Choose a disease of interest (e.g., Influenza, Measles, etc.) and a delay distribution (e.g., the incubation period, onset to death, etc.).

Find:

  • How many delay distributions are for that disease?

  • How many types of probability distribution (e.g., gamma, lognormal) are for a given delay in that disease?

Ask:

  • Do you recognise the papers?

  • Should {epiparameter} literature review consider any other paper?

The epidist_db() function with disease alone counts the number of entries like:

  • studies, and
  • delay distributions.

The epidist_db() function with disease and epi_dist gets a list of all entries with:

  • the complete citation,
  • the type of a probability distribution, and
  • distribution parameter values.

The combo of epidist_db() plus list_distributions() gets a data frame of all entries with columns like:

  • the type of the probability distribution per delay, and
  • author and year of the study.

We choose to explore Ebola’s delay distributions:

R

# we expect 16 delays distributions for ebola
epiparameter::epidist_db(
  disease = "ebola"
)

OUTPUT

Returning 17 results that match the criteria (17 are parameterised). 
Use subset to filter by entry variables or single_epidist to return a single entry. 
To retrieve the short citation for each use the 'get_citation' function

OUTPUT

List of <epidist> objects
  Number of entries in library: 17
  Number of studies in library: 4
  Number of diseases: 1
  Number of delay distributions: 16
  Number of offspring distributions: 1

Now, from the output of epiparameter::epidist_db(), What is an offspring distribution?

We choose to find Ebola’s incubation periods. This output list all the papers and parameters found. Run this locally if needed:

R

epiparameter::epidist_db(
  disease = "ebola",
  epi_dist = "incubation"
)

We use list_distributions() to get a summary display of all:

R

# we expect 2 different types of delay distributions
# for ebola incubation period
epiparameter::epidist_db(
  disease = "ebola",
  epi_dist = "incubation"
) %>%
  list_distributions()

OUTPUT

Returning 5 results that match the criteria (5 are parameterised). 
Use subset to filter by entry variables or single_epidist to return a single entry. 
To retrieve the short citation for each use the 'get_citation' function

OUTPUT

              disease  epi_distribution prob_distribution       author year
1 Ebola Virus Disease incubation period             lnorm Martin E.... 2011
2 Ebola Virus Disease incubation period             gamma WHO Ebol.... 2015
3 Ebola Virus Disease incubation period             gamma WHO Ebol.... 2015
4 Ebola Virus Disease incubation period             gamma WHO Ebol.... 2015
5 Ebola Virus Disease incubation period             gamma WHO Ebol.... 2015

We find two types of probability distributions for this query: lognormal and gamma.

How does {epiparameter} do the collection and review of peer-reviewed literature? We invite you to read the vignette on “Data Collation and Synthesis Protocol”!

Select a single distribution


The epiparameter::epidist_db() function works as a filtering or subset function. Let’s use the author argument to filter Hiroshi Nishiura parameters:

R

epiparameter::epidist_db(
  disease = "covid",
  epi_dist = "serial",
  author = "Hiroshi"
) %>%
  epiparameter::list_distributions()

OUTPUT

Returning 2 results that match the criteria (2 are parameterised). 
Use subset to filter by entry variables or single_epidist to return a single entry. 
To retrieve the short citation for each use the 'get_citation' function

OUTPUT

   disease epi_distribution prob_distribution       author year
1 COVID-19  serial interval             lnorm Hiroshi .... 2020
2 COVID-19  serial interval           weibull Hiroshi .... 2020

We still get more than one epidemiological parameter. We can set the single_epidist argument to TRUE to only one:

R

epiparameter::epidist_db(
  disease = "covid",
  epi_dist = "serial",
  author = "Hiroshi",
  single_epidist = TRUE
)

OUTPUT

Using Nishiura H, Linton N, Akhmetzhanov A (2020). "Serial interval of novel
coronavirus (COVID-19) infections." _International Journal of
Infectious Diseases_. doi:10.1016/j.ijid.2020.02.060
<https://doi.org/10.1016/j.ijid.2020.02.060>.. 
To retrieve the short citation use the 'get_citation' function

OUTPUT

Disease: COVID-19
Pathogen: SARS-CoV-2
Epi Distribution: serial interval
Study: Nishiura H, Linton N, Akhmetzhanov A (2020). "Serial interval of novel
coronavirus (COVID-19) infections." _International Journal of
Infectious Diseases_. doi:10.1016/j.ijid.2020.02.060
<https://doi.org/10.1016/j.ijid.2020.02.060>.
Distribution: lnorm
Parameters:
  meanlog: 1.386
  sdlog: 0.568

How does ‘single_epidist’ works?

Looking at the help documentation for ?epiparameter::epidist_db():

  • If multiple entries match the arguments supplied and single_epidist = TRUE, then the parameterised <epidist> with the largest sample size will be returned.
  • If multiple entries are equal after this sorting, the first entry will be returned.

What is a parametrised <epidist>? Look at ?is_parameterised.

Let’s assign this <epidist> class object to the covid_serialint object.

R

covid_serialint <-
  epiparameter::epidist_db(
    disease = "covid",
    epi_dist = "serial",
    author = "Nishiura",
    single_epidist = TRUE
  )

You can use plot() to <epidist> objects to visualise:

  • the Probability Density Function (PDF) and
  • the Cumulative Distribution Function (CDF).

R

# plot <epidist> object
plot(covid_serialint)

With the day_range argument, you can change the length or number of days in the x axis. Explore what this looks like:

R

# plot <epidist> object
plot(covid_serialint, day_range = 0:20)

Extract the summary statistics


We can get the mean and standard deviation (sd) from this <epidist> diving into the summary_stats object:

R

# get the mean
covid_serialint$summary_stats$mean

OUTPUT

[1] 4.7

Now, we have an epidemiological parameter we can reuse! We can replace the summary statistics numbers we plug into the EpiNow2::dist_spec() function:

R

generation_time <- 
  EpiNow2::dist_spec(
    mean = covid_serialint$summary_stats$mean, # replaced!
    sd = covid_serialint$summary_stats$sd, # replaced!
    max = 20,
    distribution = "gamma"
  )

In the next episode we’ll learn how to use EpiNow2 to correctly specify distributions, estimate transmissibility. Then, how to use distribution functions to get a maximum value (max) for EpiNow2::dist_spec() and use {epiparameter} in your analysis.

Log normal distributions

If you need the log normal distribution parameters instead of the summary statistics, we can use epiparameter::get_parameters():

R

covid_serialint_parameters <-
  epiparameter::get_parameters(covid_serialint)

covid_serialint_parameters

OUTPUT

  meanlog     sdlog 
1.3862617 0.5679803 

This gets a vector of class <numeric> ready to use as input for any other package!

Challenges


Ebola’s serial interval

Take 1 minute to:

Get access to the Ebola serial interval with the highest sample size.

Answer:

  • What is the sd of the epidemiological distribution?

  • What is the sample_size used in that study?

Use the $ operator plus the tab or keyboard button to explore them as an expandable list:

R

covid_serialint$

Use the str() to display the structure of the <epidist> R object.

R

# ebola serial interval
ebola_serial <-
  epiparameter::epidist_db(
    disease = "ebola",
    epi_dist = "serial",
    single_epidist = TRUE
  )

OUTPUT

Using WHO Ebola Response Team, Agua-Agum J, Ariyarajah A, Aylward B, Blake I,
Brennan R, Cori A, Donnelly C, Dorigatti I, Dye C, Eckmanns T, Ferguson
N, Formenty P, Fraser C, Garcia E, Garske T, Hinsley W, Holmes D,
Hugonnet S, Iyengar S, Jombart T, Krishnan R, Meijers S, Mills H,
Mohamed Y, Nedjati-Gilani G, Newton E, Nouvellet P, Pelletier L,
Perkins D, Riley S, Sagrado M, Schnitzler J, Schumacher D, Shah A, Van
Kerkhove M, Varsaneux O, Kannangarage N (2015). "West African Ebola
Epidemic after One Year — Slowing but Not Yet under Control." _The New
England Journal of Medicine_. doi:10.1056/NEJMc1414992
<https://doi.org/10.1056/NEJMc1414992>.. 
To retrieve the short citation use the 'get_citation' function

R

ebola_serial

OUTPUT

Disease: Ebola Virus Disease
Pathogen: Ebola Virus
Epi Distribution: serial interval
Study: WHO Ebola Response Team, Agua-Agum J, Ariyarajah A, Aylward B, Blake I,
Brennan R, Cori A, Donnelly C, Dorigatti I, Dye C, Eckmanns T, Ferguson
N, Formenty P, Fraser C, Garcia E, Garske T, Hinsley W, Holmes D,
Hugonnet S, Iyengar S, Jombart T, Krishnan R, Meijers S, Mills H,
Mohamed Y, Nedjati-Gilani G, Newton E, Nouvellet P, Pelletier L,
Perkins D, Riley S, Sagrado M, Schnitzler J, Schumacher D, Shah A, Van
Kerkhove M, Varsaneux O, Kannangarage N (2015). "West African Ebola
Epidemic after One Year — Slowing but Not Yet under Control." _The New
England Journal of Medicine_. doi:10.1056/NEJMc1414992
<https://doi.org/10.1056/NEJMc1414992>.
Distribution: gamma
Parameters:
  shape: 2.188
  scale: 6.490

R

# get the sd
ebola_serial$summary_stats$sd

OUTPUT

[1] 9.6

R

# get the sample_size
ebola_serial$metadata$sample_size

OUTPUT

[1] 305

Try to visualise this distribution using plot().

Also, explore all the other nested elements within the <epidist> object.

Share about:

  • What elements do you find useful for your analysis?
  • What other elements would you like to see in this object? How?

Ebola’s severity parameter

A severity parameter like the duration of hospitalisation could add to the information needed about the bed capacity in response to an outbreak (Cori et al., 2017).

For Ebola:

  • What is the reported point estimate of the mean duration of health care and case isolation?

An informative delay should measure the time from symptom onset to recovery or death.

Find a way to access the whole {epiparameter} database and find how that delay may be stored. The list_distributions() output is a dataframe.

R

# one way to get the list of all the available parameters
epidist_db(disease = "all") %>%
  list_distributions() %>%
  as_tibble() %>%
  distinct(epi_distribution)

OUTPUT

Returning 122 results that match the criteria (99 are parameterised). 
Use subset to filter by entry variables or single_epidist to return a single entry. 
To retrieve the short citation for each use the 'get_citation' function

OUTPUT

# A tibble: 12 × 1
   epi_distribution            
   <chr>                       
 1 incubation period           
 2 serial interval             
 3 generation time             
 4 onset to death              
 5 offspring distribution      
 6 hospitalisation to death    
 7 hospitalisation to discharge
 8 notification to death       
 9 notification to discharge   
10 onset to discharge          
11 onset to hospitalisation    
12 onset to ventilation        

R

ebola_severity <- epidist_db(
  disease = "ebola",
  epi_dist = "onset to discharge"
)

OUTPUT

Returning 1 results that match the criteria (1 are parameterised). 
Use subset to filter by entry variables or single_epidist to return a single entry. 
To retrieve the short citation for each use the 'get_citation' function

R

# point estimate
ebola_severity$summary_stats$mean

OUTPUT

[1] 15.1

Check that for some {epiparameter} entries you will also have the uncertainty around the point estimate of each summary statistic:

R

# 95% confidence intervals
ebola_severity$summary_stats$mean_ci

OUTPUT

[1] 95

R

# limits of the confidence intervals
ebola_severity$summary_stats$mean_ci_limits

OUTPUT

[1] 14.6 15.6

The distribution Zoo

Explore this shinyapp called The Distribution Zoo!

Follow these steps to reproduce the form of the COVID serial interval distribution from {epiparameter} (covid_serialint object):

  1. Access the https://ben18785.shinyapps.io/distribution-zoo/ shiny app website,
  2. Go to the left panel,
  3. Keep the Category of distribution: Continuous Univariate,
  4. Select a new Type of distribution: Log-Normal,
  5. Move the sliders, i.e. the graphical control element that allows you to adjust a value by moving a handle along a horizontal track or bar to the covid_serialint parameters.

Replicate these with the distribution object and all its list elements: [[2]], [[3]], and [[4]]. Explore how the shape of a distribution changes when its parameters change.

Share about:

  • What other features of the website do you find helpful?

Key Points

  • Use {epiparameter} to access the literature catalogue of epidemiological delay distributions.
  • Use epidist_db() to select single delay distributions.
  • Use list_distributions() for an overview of multiple delay distributions.
  • Reuse known estimates for unknown disease in the early stage of an outbreak when no contact tracing data is available.

Content from Quantifying transmission


Last updated on 2024-05-04 | Edit this page

Overview

Questions

  • How can I estimate the time-varying reproduction number (\(Rt\)) and growth rate from a time series of case data?
  • How can I quantify geographical heterogeneity from these transmission metrics?

Objectives

  • Learn how to estimate transmission metrics from a time series of case data using the R package EpiNow2

Prerequisites

Learners should familiarise themselves with following concepts before working through this tutorial:

Statistics: probability distributions, principle of Bayesian analysis.

Epidemic theory: Effective reproduction number.

Reminder: the Effective Reproduction Number,\(R_t\)

The basic reproduction number, \(R_0\), is the average number of cases caused by one infectious individual in a entirely susceptible population.

But in an ongoing outbreak, the population does not remain entirely susceptible as those that recover from infection are typically immune. Moreover, there can be changes in behaviour or other factors that affect transmission. When we are interested in monitoring changes in transmission we are therefore more interested in the value of the effective reproduction number, \(R_t\), the average number of cases caused by one infectious individual in the population at time \(t\).

Introduction


The transmission intensity of an outbreak is quantified using two key metrics: the reproduction number, which informs on the strength of the transmission by indicating how many new cases are expected from each existing case; and the growth rate, which informs on the speed of the transmission by indicating how rapidly the outbreak is spreading or declining (doubling/halving time) within a population. For more details on the distinction between speed and strength of transmission and implications for control, review Dushoff & Park, 2021.

To estimate these key metrics using case data we must account for delays between the date of infections and date of reported cases. In an outbreak situation, data are usually available on reported dates only, therefore we must use estimation methods to account for these delays when trying to understand changes in transmission over time.

In the next tutorials we will focus on how to use the functions in EpiNow2 to estimate transmission metrics of case data. We will not cover the theoretical background of the models or inference framework, for details on these concepts see the vignette.

In this tutorial we are going to learn how to use the EpiNow2 package to estimate the time-varying reproduction number. We’ll use the dplyr package to arrange some of its inputs, ggplot2 to visualize case distribution, and the pipe %>% to connect some of their functions, so let’s also call to the tidyverse package:

R

library(EpiNow2)
library(tidyverse)

The double-colon

The double-colon :: in R is used to access functions or objects from a specific package without loading the entire package into the current environment. This allows for a more targeted approach to using package components and helps avoid namespace conflicts.

:: lets you call a specific function from a package by explicitly mentioning the package name. For example, dplyr::filter(data, condition) uses filter() from the dplyr package without loading the entire package.

Bayesian inference

The R package EpiNow2 uses a Bayesian inference framework to estimate reproduction numbers and infection times based on reporting dates.

In Bayesian inference, we use prior knowledge (prior distributions) with data (in a likelihood function) to find the posterior probability.

Posterior probability \(\propto\) likelihood \(\times\) prior probability

Delay distributions and case data


Case data

To illustrate the functions of EpiNow2 we will use outbreak data of the start of the COVID-19 pandemic from the United Kingdom. The data are available in the R package incidence2.

R

dplyr::as_tibble(incidence2::covidregionaldataUK)

OUTPUT

# A tibble: 6,370 × 13
   date       region   region_code cases_new cases_total deaths_new deaths_total
   <date>     <chr>    <chr>           <dbl>       <dbl>      <dbl>        <dbl>
 1 2020-01-30 East Mi… E12000004          NA          NA         NA           NA
 2 2020-01-30 East of… E12000006          NA          NA         NA           NA
 3 2020-01-30 England  E92000001           2           2         NA           NA
 4 2020-01-30 London   E12000007          NA          NA         NA           NA
 5 2020-01-30 North E… E12000001          NA          NA         NA           NA
 6 2020-01-30 North W… E12000002          NA          NA         NA           NA
 7 2020-01-30 Norther… N92000002          NA          NA         NA           NA
 8 2020-01-30 Scotland S92000003          NA          NA         NA           NA
 9 2020-01-30 South E… E12000008          NA          NA         NA           NA
10 2020-01-30 South W… E12000009          NA          NA         NA           NA
# ℹ 6,360 more rows
# ℹ 6 more variables: recovered_new <dbl>, recovered_total <dbl>,
#   hosp_new <dbl>, hosp_total <dbl>, tested_new <dbl>, tested_total <dbl>

To use the data, we must format the data to have two columns:

  • date: the date (as a date object see ?is.Date()),
  • confirm: number of confirmed cases on that date.

Let’s use dplyr for this:

R

library(dplyr)

cases <- incidence2::covidregionaldataUK %>%
  select(date, cases_new) %>%
  group_by(date) %>%
  summarise(confirm = sum(cases_new, na.rm = TRUE)) %>%
  ungroup()

We can also use the incidence2 package to:

  • Aggregate cases (similar to the code above) but in different time intervals (i.e., days, weeks or months) or per group categories. Explore later the incidence2::incidence() reference manual.

  • Complete dates for all the range of dates per group category using incidence2::complete_dates(). Read further in its function reference manual.

R

library(tidyr)
library(dplyr)

incidence2::covidregionaldataUK %>%
  # preprocess missing values
  tidyr::replace_na(list(cases_new = 0)) %>%
  # compute the daily incidence
  incidence2::incidence(
    date_index = "date",
    counts = "cases_new",
    count_values_to = "confirm",
    date_names_to = "date"
  ) %>%
  # complete range of dates
  incidence2::complete_dates()

There are case data available for 490 days, but in an outbreak situation it is likely we would only have access to the beginning of this data set. Therefore we assume we only have the first 90 days of this data.

Delay distributions

We assume there are delays from the time of infection until the time a case is reported. We specify these delays as distributions to account for the uncertainty in individual level differences. The delay can consist of multiple types of delays/processes. A typical delay from time of infection to case reporting may consist of:

time from infection to symptom onset (the incubation period) + time from symptom onset to case notification (the reporting time) .

The delay distribution for each of these processes can either estimated from data or obtained from the literature. We can express uncertainty about what the correct parameters of the distributions by assuming the distributions have fixed parameters or whether they have variable parameters. To understand the difference between fixed and variable distributions, let’s consider the incubation period.

Delays and data

The number of delays and type of delay are a flexible input that depend on the data. The examples below highlight how the delays can be specified for different data sources:

Data source Delay(s)
Time of symptom onset Incubation period
Time of case report Incubation period + time from symptom onset to case notification
Time of hospitalisation Incubation period + time from symptom onset to hospitalisation

Incubation period distribution

The distribution of incubation period for many diseases can usually be obtained from the literature. The package {epiparameter} contains a library of epidemiological parameters for different diseases obtained from the literature.

We specify a (fixed) gamma distribution with mean \(\mu = 4\) and standard deviation \(\sigma= 2\) (shape = \(4\), scale = \(1\)) using the function dist_spec() as follows:

R

incubation_period_fixed <- dist_spec(
  mean = 4, sd = 2,
  max = 20, distribution = "gamma"
)
incubation_period_fixed

OUTPUT


  Fixed distribution with PMF [0.019 0.12 0.21 0.21 0.17 0.11 0.069 0.039 0.021 0.011 0.0054 0.0026 0.0012 0.00058 0.00026 0.00012 5.3e-05 2.3e-05 1e-05 4.3e-06]

The argument max is the maximum value the distribution can take, in this example 20 days.

Why a gamma distrubution?

The incubation period has to be positive in value. Therefore we must specific a distribution in dist_spec which is for positive values only.

dist_spec() supports log normal and gamma distributions, which are distributions for positive values only.

For all types of delay, we will need to use distributions for positive values only - we don’t want to include delays of negative days in our analysis!

Including distribution uncertainty

To specify a variable distribution, we include uncertainty around the mean \(\mu\) and standard deviation \(\sigma\) of our gamma distribution. If our incubation period distribution has a mean \(\mu\) and standard deviation \(\sigma\), then we assume the mean (\(\mu\)) follows a Normal distribution with standard deviation \(\sigma_{\mu}\):

\[\mbox{Normal}(\mu,\sigma_{\mu}^2)\]

and a standard deviation (\(\sigma\)) follows a Normal distribution with standard deviation \(\sigma_{\sigma}\):

\[\mbox{Normal}(\sigma,\sigma_{\sigma}^2).\]

We specify this using dist_spec with the additional arguments mean_sd (\(\sigma_{\mu}\)) and sd_sd (\(\sigma_{\sigma}\)).

R

incubation_period_variable <- dist_spec(
  mean = 4, sd = 2,
  mean_sd = 0.5, sd_sd = 0.5,
  max = 20, distribution = "gamma"
)
incubation_period_variable

OUTPUT


  Uncertain gamma distribution with (untruncated) mean 4 (SD 0.5) and SD 2 (SD 0.5)

Reporting delays

After the incubation period, there will be an additional delay of time from symptom onset to case notification: the reporting delay. We can specify this as a fixed or variable distribution, or estimate a distribution from data.

When specifying a distribution, it is useful to visualise the probability density to see the peak and spread of the distribution, in this case we will use a log normal distribution. We can use the functions convert_to_logmean() and convert_to_logsd() to convert the mean and standard deviation of a normal distribution to that of a log normal distribution.

If we want to assume that the mean reporting delay is 2 days (with a standard deviation of 1 day), the log normal distribution will look like:

R

log_mean <- convert_to_logmean(2, 1)
log_sd <- convert_to_logsd(2, 1)
x <- seq(from = 0, to = 10, length = 1000)
df <- data.frame(x = x, density = dlnorm(x, meanlog = log_mean, sdlog = log_sd))
ggplot(df) +
  geom_line(
    aes(x, density)
  ) +
  theme_grey(
    base_size = 15
  )

Using the mean and standard deviation for the log normal distribution, we can specify a fixed or variable distribution using dist_spec() as before:

R

reporting_delay_variable <- dist_spec(
  mean = log_mean, sd = log_sd,
  mean_sd = 0.5, sd_sd = 0.5,
  max = 10, distribution = "lognormal"
)

If data is available on the time between symptom onset and reporting, we can use the function estimate_delay() to estimate a log normal distribution from a vector of delays. The code below illustrates how to use estimate_delay() with synthetic delay data.

R

delay_data <- rlnorm(500, log(5), 1) # synthetic delay data
reporting_delay <- estimate_delay(
  delay_data,
  samples = 1000,
  bootstraps = 10
)

Generation time

We also must specify a distribution for the generation time. Here we will use a log normal distribution with mean 3.6 and standard deviation 3.1 (Ganyani et al. 2020).

R

generation_time_variable <- dist_spec(
  mean = 3.6, sd = 3.1,
  mean_sd = 0.5, sd_sd = 0.5,
  max = 20, distribution = "lognormal"
)

Finding estimates


The function epinow() is a wrapper for the function estimate_infections() used to estimate cases by date of infection. The generation time distribution and delay distributions must be passed using the functions generation_time_opts() and delay_opts() respectively.

There are numerous other inputs that can be passed to epinow(), see EpiNow2::?epinow() for more detail. One optional input is to specify a log normal prior for the effective reproduction number \(R_t\) at the start of the outbreak. We specify a mean and standard deviation as arguments of prior within rt_opts():

R

rt_log_mean <- convert_to_logmean(2, 1)
rt_log_sd <- convert_to_logsd(2, 1)
rt <- rt_opts(prior = list(mean = rt_log_mean, sd = rt_log_sd))

Bayesian inference using Stan

The Bayesian inference is performed using MCMC methods with the program Stan. There are a number of default inputs to the Stan functions including the number of chains and number of samples per chain (see ?EpiNow2::stan_opts()).

To reduce computation time, we can run chains in parallel. To do this, we must set the number of cores to be used. By default, 4 MCMC chains are run (see stan_opts()$chains), so we can set an equal number of cores to be used in parallel as follows:

R

withr::local_options(list(mc.cores = 4))

To find the maximum number of available cores on your machine, use parallel::detectCores().

Checklist

Note: In the code below _fixed distributions are used instead of _variable (delay distributions with uncertainty). This is to speed up computation time. It is generally recommended to use variable distributions that account for additional uncertainty.

R

# fixed alternatives
generation_time_fixed <- dist_spec(
  mean = 3.6, sd = 3.1,
  max = 20, distribution = "lognormal"
)

reporting_delay_fixed <- dist_spec(
  mean = log_mean, sd = log_sd,
  max = 10, distribution = "lognormal"
)

Now you are ready to run EpiNow2::epinow() to estimate the time-varying reproduction number:

R

reported_cases <- cases[1:90, ]

estimates <- epinow(
  # cases
  reported_cases = reported_cases,
  # delays
  generation_time = generation_time_opts(generation_time_fixed),
  delays = delay_opts(incubation_period_fixed + reporting_delay_fixed),
  # prior
  rt = rt_opts(prior = list(mean = rt_log_mean, sd = rt_log_sd)),
  # computation (optional)
  stan = stan_opts(samples = 1000, chains = 3)
)

OUTPUT

WARN [2024-05-04 08:57:44] epinow: There were 4 divergent transitions after warmup. See
https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them. - 
WARN [2024-05-04 08:57:44] epinow: Examine the pairs() plot to diagnose sampling problems
 - 
WARN [2024-05-04 08:57:45] epinow: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess - 

Do not wait for this to continue

Using stan = stan_opts() is optional. For the purpose of this tutorial on reducing computation time, we specified a fixed number of samples = 1000 and chains = 3 to the stan argument using the stan_opts() function. We expect this to take approximately 3 minutes.

Remember: Using an appropriate number of samples and chains is crucial for ensuring convergence and obtaining reliable estimates in Bayesian computations using Stan. Inadequate sampling or insufficient chains may lead to issues such as divergent transitions, impacting the accuracy and stability of the inference process.

Results

We can extract and visualise estimates of the effective reproduction number through time:

R

estimates$plots$R

The uncertainty in the estimates increases through time. This is because estimates are informed by data in the past - within the delay periods. This difference in uncertainty is categorised into Estimate (green) utilises all data and Estimate based on partial data (orange) estimates that are based on less data (because infections that happened at the time are more likely to not have been observed yet) and therefore have increasingly wider intervals towards the date of the last data point. Finally, the Forecast (purple) is a projection ahead of time.

We can also visualise the growth rate estimate through time:

R

estimates$plots$growth_rate

To extract a summary of the key transmission metrics at the latest date in the data:

R

summary(estimates)

OUTPUT

                                 measure                 estimate
                                  <char>                   <char>
1: New confirmed cases by infection date     7156 (4069 -- 12721)
2:        Expected change in daily cases        Likely decreasing
3:            Effective reproduction no.       0.89 (0.58 -- 1.3)
4:                        Rate of growth -0.015 (-0.062 -- 0.037)
5:          Doubling/halving time (days)          -45 (19 -- -11)

As these estimates are based on partial data, they have a wide uncertainty interval.

  • From the summary of our analysis we see that the expected change in daily cases is Likely decreasing with the estimated new confirmed cases 7156 (4069 – 12721).

  • The effective reproduction number \(R_t\) estimate (on the last date of the data) is 0.89 (0.58 – 1.3).

  • The exponential growth rate of case numbers is -0.015 (-0.062 – 0.037).

  • The doubling time (the time taken for case numbers to double) is -45 (19 – -11).

Expected change in daily cases

A factor describing expected change in daily cases based on the posterior probability that \(R_t < 1\).

Probability (\(p\)) Expected change
\(p < 0.05\) Increasing
\(0.05 \leq p< 0.4\) Likely increasing
\(0.4 \leq p< 0.6\) Stable
\(0.6 \leq p < 0.95\) Likely decreasing
\(0.95 \leq p \leq 1\) Decreasing

Quantify geographical heterogeneity


The outbreak data of the start of the COVID-19 pandemic from the United Kingdom from the R package incidence2 includes the region in which the cases were recorded. To find regional estimates of the effective reproduction number and cases, we must format the data to have three columns:

  • date: the date,
  • region: the region,
  • confirm: number of confirmed cases for a region on a given date.

R

regional_cases <-
  incidence2::covidregionaldataUK[, c("date", "cases_new", "region")]
colnames(regional_cases) <- c("date", "confirm", "region")

# extract the first 90 dates for all regions
dates <- sort(unique(regional_cases$date))[1:90]
regional_cases <- regional_cases[which(regional_cases$date %in% dates), ]

head(regional_cases)

OUTPUT

        date confirm          region
1 2020-01-30      NA   East Midlands
2 2020-01-30      NA East of England
3 2020-01-30       2         England
4 2020-01-30      NA          London
5 2020-01-30      NA      North East
6 2020-01-30      NA      North West

To find regional estimates, we use the same inputs as epinow() to the function regional_epinow():

R

estimates_regional <- regional_epinow(
  # cases
  reported_cases = regional_cases,
  # delays
  generation_time = generation_time_opts(generation_time_fixed),
  delays = delay_opts(incubation_period_fixed + reporting_delay_fixed),
  # prior
  rt = rt_opts(prior = list(mean = rt_log_mean, sd = rt_log_sd))
)

OUTPUT

INFO [2024-05-04 08:57:48] Producing following optional outputs: regions, summary, samples, plots, latest
INFO [2024-05-04 08:57:48] Reporting estimates using data up to: 2020-04-28
INFO [2024-05-04 08:57:48] No target directory specified so returning output
INFO [2024-05-04 08:57:48] Producing estimates for: East Midlands, East of England, England, London, North East, North West, Northern Ireland, Scotland, South East, South West, Wales, West Midlands, Yorkshire and The Humber
INFO [2024-05-04 08:57:48] Regions excluded: none
INFO [2024-05-04 09:42:33] Completed regional estimates
INFO [2024-05-04 09:42:33] Regions with estimates: 13
INFO [2024-05-04 09:42:33] Regions with runtime errors: 0
INFO [2024-05-04 09:42:33] Producing summary
INFO [2024-05-04 09:42:33] No summary directory specified so returning summary output
INFO [2024-05-04 09:42:34] No target directory specified so returning timings

R

estimates_regional$summary$summarised_results$table

OUTPUT

                      Region New confirmed cases by infection date
                      <char>                                <char>
 1:            East Midlands                      345 (212 -- 547)
 2:          East of England                      541 (336 -- 855)
 3:                  England                   3496 (2207 -- 5608)
 4:                   London                      298 (183 -- 460)
 5:               North East                      252 (149 -- 426)
 6:               North West                      559 (330 -- 879)
 7:         Northern Ireland                         44 (23 -- 83)
 8:                 Scotland                      289 (163 -- 517)
 9:               South East                      591 (350 -- 983)
10:               South West                      417 (295 -- 605)
11:                    Wales                        95 (65 -- 138)
12:            West Midlands                      271 (143 -- 485)
13: Yorkshire and The Humber                      484 (287 -- 777)
    Expected change in daily cases Effective reproduction no.
                            <fctr>                     <char>
 1:              Likely increasing          1.2 (0.85 -- 1.6)
 2:              Likely increasing          1.2 (0.83 -- 1.6)
 3:              Likely decreasing          0.9 (0.64 -- 1.2)
 4:              Likely decreasing         0.79 (0.55 -- 1.1)
 5:              Likely decreasing         0.91 (0.61 -- 1.3)
 6:              Likely decreasing         0.87 (0.58 -- 1.2)
 7:              Likely decreasing           0.65 (0.39 -- 1)
 8:              Likely decreasing         0.92 (0.61 -- 1.4)
 9:                         Stable         0.99 (0.67 -- 1.4)
10:                     Increasing           1.4 (1.1 -- 1.8)
11:                     Decreasing        0.57 (0.42 -- 0.76)
12:              Likely decreasing         0.71 (0.42 -- 1.1)
13:                         Stable             1 (0.7 -- 1.4)
               Rate of growth Doubling/halving time (days)
                       <char>                       <char>
 1:    0.024 (-0.02 -- 0.067)               29 (10 -- -34)
 2:   0.022 (-0.023 -- 0.066)               31 (10 -- -30)
 3:   -0.013 (-0.053 -- 0.03)              -53 (23 -- -13)
 4:  -0.029 (-0.069 -- 0.012)              -24 (57 -- -10)
 5:  -0.013 (-0.058 -- 0.036)              -55 (19 -- -12)
 6:  -0.017 (-0.062 -- 0.023)              -40 (30 -- -11)
 7:   -0.051 (-0.1 -- 0.0056)              -13 (120 -- -7)
 8:  -0.011 (-0.058 -- 0.043)              -63 (16 -- -12)
 9:  -0.0011 (-0.048 -- 0.05)             -610 (14 -- -15)
10:    0.047 (0.013 -- 0.085)               15 (8.1 -- 54)
11: -0.065 (-0.092 -- -0.033)            -11 (-21 -- -7.5)
12:  -0.041 (-0.093 -- 0.013)             -17 (52 -- -7.5)
13:  0.0037 (-0.043 -- 0.048)              190 (14 -- -16)

R

estimates_regional$summary$plots$R

the i2extras package

i2extras package also estimate transmission metrics like growth rate and doubling/halving time at different time intervals (i.e., days, weeks, or months). i2extras require incidence2 objects as inputs. Read further in its Fitting curves vignette.

Summary


EpiNow2 can be used to estimate transmission metrics from case data at any time in the course of an outbreak. The reliability of these estimates depends on the quality of the data and appropriate choice of delay distributions. In the next tutorial we will learn how to make forecasts and investigate some of the additional inference options available in EpiNow2.

Key Points

  • Transmission metrics can be estimated from case data after accounting for delays
  • Uncertainty can be accounted for in delay distributions

Content from Use delay distributions in analysis


Last updated on 2024-05-04 | Edit this page

Overview

Questions

  • How to reuse delays stored in the {epiparameter} library with my existing analysis pipeline?

Objectives

  • Use distribution functions to continuous and discrete distributions stored as <epidist> objects.
  • Convert a continuous to a discrete distribution with {epiparameter}.
  • Connect {epiparameter} outputs with EpiNow2 inputs.

Prerequisites

This episode requires you to be familiar with:

Data science : Basic programming with R.

Statistics : Probability distributions.

Epidemic theory : Epidemiological parameters, time periods, Effective reproductive number.

Introduction


{epiparameter} help us to choose one specific set of epidemiological parameters from the literature, instead of copy/pasting them by hand:

R

covid_serialint <-
  epiparameter::epidist_db(
    disease = "covid",
    epi_dist = "serial",
    author = "Nishiura",
    single_epidist = TRUE
  )

Now, we have an epidemiological parameter we can use in our analysis! In the chunk below we replaced one of the summary statistics inputs into EpiNow2::dist_spec()

R

generation_time <- 
  EpiNow2::dist_spec(
    mean = covid_serialint$summary_stats$mean, # replaced!
    sd = covid_serialint$summary_stats$sd, # replaced!
    max = 20,
    distribution = "gamma"
  )

In this episode, we will use the distribution functions that {epiparameter} provides to get a maximum value (max) for this and any other package downstream in your analysis pipeline!

Let’s load the {epiparameter} and EpiNow2 package. For EpiNow2, we’ll set 4 cores to be used in parallel computations. We’ll use the pipe %>%, some dplyr verbs and ggplot2, so let’s also call to the tidyverse package:

R

library(epiparameter)
library(EpiNow2)
library(tidyverse)

withr::local_options(list(mc.cores = 4))

The double-colon

The double-colon :: in R is used to access functions or objects from a specific package without loading the entire package into the current environment. This allows for a more targeted approach to using package components and helps avoid namespace conflicts.

:: lets you call a specific function from a package by explicitly mentioning the package name. For example, dplyr::filter(data, condition) uses filter() from the dplyr package without loading the entire package.

Distribution functions


In R, all the statistical distributions have functions to access the following:

  • density(): Probability Density function (PDF),
  • cdf(): Cumulative Distribution function (CDF),
  • quantile(): Quantile function, and
  • generate(): Random values from the given distribution.

Functions for the Normal distribution

If you need it, read in detail about the R probability functions for the normal distribution, each of its definitions and identify in which part of a distribution they are located!

The four probability functions for the normal distribution (Jack Weiss, 2012)
The four probability functions for the normal distribution (Jack Weiss, 2012)

If you look at ?stats::Distributions, each type of distribution has a unique set of functions. However, {epiparameter} gives you the same four functions to access each of the values above for any <epidist> object you want!

R

# plot this to have a visual reference
plot(covid_serialint, day_range = 0:20)

R

# the density value at quantile value of 10 (days)
density(covid_serialint, at = 10)

OUTPUT

[1] 0.01911607

R

# the cumulative probability at quantile value of 10 (days)
cdf(covid_serialint, q = 10)

OUTPUT

[1] 0.9466605

R

# the quantile value (day) at a cumulative probability of 60%
quantile(covid_serialint, p = 0.6)

OUTPUT

[1] 4.618906

R

# generate 10 random values (days) given
# the distribution family and its parameters
generate(covid_serialint, times = 10)

OUTPUT

 [1]  3.698099 14.965865  2.538199  3.947991  6.919752  8.699023  2.928458
 [8] 19.117628  4.032281  7.400597

Window for contact tracing and the Serial interval

The serial interval is important in the optimisation of contact tracing since it provides a time window for the containment of a disease spread (Fine, 2003). Depending on the serial interval, we can evaluate the need to expand the number of days pre-onset to consider in the contact tracing to include more backwards contacts (Davis et al., 2020).

With the COVID-19 serial interval (covid_serialint) calculate:

  • How much more of the backward cases could be captured if the contact tracing method considered contacts up to 6 days pre-onset compared to 2 days pre-onset?

In Figure 5 from the R probability functions for the normal distribution, the shadowed section represents a cumulative probability of 0.997 for the quantile value at x = 2.

R

plot(covid_serialint)

R

cdf(covid_serialint, q = 2)

OUTPUT

[1] 0.1111729

R

cdf(covid_serialint, q = 6)

OUTPUT

[1] 0.7623645

Given the COVID-19 serial interval:

  • A contact tracing method considering contacts up to 2 days pre-onset will capture around 11.1% of backward cases.

  • If this period is extended to 6 days pre-onset, this could include 76.2% of backward contacts.

If we exchange the question between days and cumulative probability to:

  • When considering secondary cases, how many days following the symptom onset of primary cases can we expect 55% of symptom onset to occur?

R

quantile(covid_serialint, p = 0.55)

An interpretation could be:

  • The 55% percent of the symptom onset of secondary cases will happen after 4.2 days after the symptom onset of primary cases.

Discretise a continuous distribution


We are getting closer to the end! EpiNow2::dist_spec() still needs a maximum value (max).

One way to do this is to get the quantile value for the distribution’s 99.9th percentile or 0.999 cumulative probability. For this, we need access to the set of distribution functions for our <epidist> object.

We can use the set of distribution functions for a continuous distribution (as above). However, these values will be continuous numbers. We can discretise the continuous distribution stored in our <epidist> object to get discrete values from a continuous distribution.

When we epiparameter::discretise() the continuous distribution we get a discrete(-ized) distribution:

R

covid_serialint_discrete <-
  epiparameter::discretise(covid_serialint)

covid_serialint_discrete

OUTPUT

Disease: COVID-19
Pathogen: SARS-CoV-2
Epi Distribution: serial interval
Study: Nishiura H, Linton N, Akhmetzhanov A (2020). "Serial interval of novel
coronavirus (COVID-19) infections." _International Journal of
Infectious Diseases_. doi:10.1016/j.ijid.2020.02.060
<https://doi.org/10.1016/j.ijid.2020.02.060>.
Distribution: discrete lnorm
Parameters:
  meanlog: 1.386
  sdlog: 0.568

We identify this change in the Distribution: output line of the <epidist> object. Double check this line:

Distribution: discrete lnorm

While for a continuous distribution, we plot the Probability Density Function (PDF), for a discrete distribution, we plot the Probability Mass Function (PMF):

R

# continuous
plot(covid_serialint)

# discrete
plot(covid_serialint_discrete)

To finally get a max value, let’s access the quantile value of the 99.9th percentile or 0.999 probability of the distribution with the prob_dist$q notation, similarly to how we access the summary_stats values.

R

covid_serialint_discrete_max <-
  quantile(covid_serialint_discrete, p = 0.999)

Length of quarantine and Incubation period

The incubation period distribution is a useful delay to assess the length of active monitoring or quarantine (Lauer et al., 2020). Similarly, delays from symptom onset to recovery (or death) will determine the required duration of health care and case isolation (Cori et al., 2017).

Calculate:

  • Within what exact time frame do 99% of individuals exhibiting COVID-19 symptoms exhibit them after infection?

What delay distribution measures the time between infection and the onset of symptoms?

The probability functions for <epidist> discrete distributions are the same that we used for the continuous ones!

R

# plot to have a visual reference
plot(covid_serialint_discrete, day_range = 0:20)

# density value at quantile value 10 (day)
density(covid_serialint_discrete, at = 10)

# cumulative probability at quantile value 10 (day)
cdf(covid_serialint_discrete, q = 10)

# In what quantile value (days) do we have the 60% cumulative probability?
quantile(covid_serialint_discrete, p = 0.6)

# generate random values
generate(covid_serialint_discrete, times = 10)

R

covid_incubation <-
  epiparameter::epidist_db(
    disease = "covid",
    epi_dist = "incubation",
    single_epidist = TRUE
  )

OUTPUT

Using McAloon C, Collins Á, Hunt K, Barber A, Byrne A, Butler F, Casey M,
Griffin J, Lane E, McEvoy D, Wall P, Green M, O'Grady L, More S (2020).
"Incubation period of COVID-19: a rapid systematic review and
meta-analysis of observational research." _BMJ Open_.
doi:10.1136/bmjopen-2020-039652
<https://doi.org/10.1136/bmjopen-2020-039652>.. 
To retrieve the short citation use the 'get_citation' function

R

covid_incubation_discrete <- epiparameter::discretise(covid_incubation)

quantile(covid_incubation_discrete, p = 0.99)

OUTPUT

[1] 16

99% of those who develop COVID-19 symptoms will do so within 16 days of infection.

Now, Is this result expected in epidemiological terms?

From a maximum value with quantile(), we can create a sequence of quantile values as a numeric vector and calculate density() values for each:

R

# create a discrete distribution visualisation
# from a maximum value from the distribution
quantile(covid_serialint_discrete, p = 0.999) %>%
  # generate quantile values
  # as a sequence for each natural number
  seq(1L, to = ., by = 1L) %>%
  # coerce numeric vector to data frame
  as_tibble_col(column_name = "quantile_values") %>%
  mutate(
    # calculate density values
    # for each quantile in the density function
    density_values =
      density(
        x = covid_serialint_discrete,
        at = quantile_values
      )
  ) %>%
  # create plot
  ggplot(
    aes(
      x = quantile_values,
      y = density_values
    )
  ) +
  geom_col()

Remember: In infections with pre-symptomatic transmission, serial intervals can have negative values (Nishiura et al., 2020). When we use the serial interval to approximate the generation time we need to make this distribution with positive values only!

Plug-in {epiparameter} to {EpiNow2}


Now we can plug everything into the EpiNow2::dist_spec() function!

  • the summary statistics mean and sd of the distribution,
  • a maximum value max,
  • the distribution name.

However, when using EpiNow2::dist_spec(), to define a Lognormal distribution, like the one in the covid_serialint object, we need to convert its summary statistics to distribution parameters named logmean and logsd. With {epiparameter} we can directly get the distribution parameters using epiparameter::get_parameters():

R

covid_serialint_parameters <-
  epiparameter::get_parameters(covid_serialint)

Then, we have:

R

serial_interval_covid <-
  dist_spec(
    mean = covid_serialint_parameters["meanlog"],
    sd = covid_serialint_parameters["sdlog"],
    max = covid_serialint_discrete_max,
    distribution = "lognormal"
  )

serial_interval_covid

OUTPUT


  Fixed distribution with PMF [0.0073 0.1 0.2 0.19 0.15 0.11 0.075 0.051 0.035 0.023 0.016 0.011 0.0076 0.0053 0.0037 0.0027 0.0019 0.0014 0.001 0.00074 0.00055 0.00041 0.00031]

Assuming a COVID-19 scenario, let’s use the first 60 days of the example_confirmed data set from the EpiNow2 package as reported_cases and the recently created serial_interval_covid object as inputs to estimate the time-varying reproduction number using EpiNow2::epinow().

R

epinow_estimates_cg <- epinow(
  # cases
  reported_cases = example_confirmed[1:60],
  # delays
  generation_time = generation_time_opts(serial_interval_covid)
)

OUTPUT

WARN [2024-05-04 09:44:07] epinow: There were 1 divergent transitions after warmup. See
https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them. - 
WARN [2024-05-04 09:44:07] epinow: Examine the pairs() plot to diagnose sampling problems
 - 

R

base::plot(epinow_estimates_cg)

The plot() output includes the estimated cases by date of infection, which are reconstructed from the reported cases and delays.

Warning

Using the serial interval instead of the generation time is an alternative that can propagate bias in your estimates, even more so in diseases with reported pre-symptomatic transmission. (Chung Lau et al., 2021)

Adjusting for reporting delays


Estimating \(R_t\) requires data on the daily number of new infections. Due to lags in the development of detectable viral loads, symptom onset, seeking care, and reporting, these numbers are not readily available. All observations reflect transmission events from some time in the past. In other words, if \(d\) is the delay from infection to observation, then observations at time \(t\) inform \(R_{t−d}\), not \(R_t\). (Gostic et al., 2020)

Timeline for chain of disease reporting, the Netherlands. Lab, laboratory; PHA, public health authority. From Marinović et al., 2015
Timeline for chain of disease reporting, the Netherlands. Lab, laboratory; PHA, public health authority. From Marinović et al., 2015

The delay distribution could be inferred jointly with the underlying times of infection or estimated as the sum of the incubation period distribution and the distribution of delays from symptom onset to observation from line list data (reporting delay). For EpiNow2, we can specify these two complementary delay distributions in the delays argument.

R_{t} is a measure of transmission at time t. Observations after time t must be adjusted. ICU, intensive care unit. From Gostic et al., 2020
\(R_{t}\) is a measure of transmission at time \(t\). Observations after time \(t\) must be adjusted. ICU, intensive care unit. From Gostic et al., 2020

Use an Incubation period for COVID-19 to estimate Rt

Estimate the time-varying reproduction number for the first 60 days of the example_confirmed data set from EpiNow2. Access to an incubation period for COVID-19 from {epiparameter} to use it as a reporting delay.

Use the last epinow() calculation using the delays argument and the delay_opts() helper function.

The delays argument and the delay_opts() helper function are analogous to the generation_time argument and the generation_time_opts() helper function.

R

epinow_estimates <- epinow(
  # cases
  reported_cases = example_confirmed[1:60],
  # delays
  generation_time = generation_time_opts(covid_serial_interval),
  delays = delay_opts(covid_incubation_time)
)

R

# generation time ---------------------------------------------------------

# get covid serial interval
covid_serialint <-
  epiparameter::epidist_db(
    disease = "covid",
    epi_dist = "serial",
    author = "Nishiura",
    single_epidist = TRUE
  )

# adapt epidist to epinow2
covid_serialint_discrete_max <-
  covid_serialint %>%
  discretise() %>%
  quantile(p = 0.999)

covid_serialint_parameters <-
  epiparameter::get_parameters(covid_serialint)

covid_serial_interval <-
  dist_spec(
    mean = covid_serialint_parameters["meanlog"],
    sd = covid_serialint_parameters["sdlog"],
    max = covid_serialint_discrete_max,
    distribution = "lognormal"
  )

# incubation time ---------------------------------------------------------

# get covid incubation period
covid_incubation <- epiparameter::epidist_db(
  disease = "covid",
  epi_dist = "incubation",
  author = "Natalie",
  single_epidist = TRUE
)

# adapt epiparameter to epinow2
covid_incubation_discrete_max <-
  covid_incubation %>%
  discretise() %>%
  quantile(p = 0.999)

covid_incubation_parameters <-
  epiparameter::get_parameters(covid_incubation)

covid_incubation_time <-
  dist_spec(
    mean = covid_incubation_parameters["meanlog"],
    sd = covid_incubation_parameters["sdlog"],
    max = covid_incubation_discrete_max,
    distribution = "lognormal" # do not forget this!
  )

# epinow ------------------------------------------------------------------

# run epinow
epinow_estimates_cgi <- epinow(
  # cases
  reported_cases = example_confirmed[1:60],
  # delays
  generation_time = generation_time_opts(covid_serial_interval),
  delays = delay_opts(covid_incubation_time)
)

OUTPUT

WARN [2024-05-04 09:46:00] epinow: There were 7 divergent transitions after warmup. See
https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them. - 
WARN [2024-05-04 09:46:00] epinow: Examine the pairs() plot to diagnose sampling problems
 - 

R

base::plot(epinow_estimates_cgi)

Try to complement the delays argument with a reporting delay like the reporting_delay_fixed object of the previous episode.

How much has it changed?

After adding the incubation period, discuss:

  • Does the trend of the model fit in the “Estimate” section change?
  • Has the uncertainty changed?
  • How would you explain or interpret any of these changes?

Compare all the EpiNow2 figures generated previously.

Challenges


A code completion tip

If we write the [] next to the object covid_serialint_parameters[], within [] we can use the Tab key for code completion feature

This gives quick access to covid_serialint_parameters["meanlog"] and covid_serialint_parameters["sdlog"].

We invite you to try this out in code chunks and the R console!

Ebola’s effective reproduction number adjusted by reporting delays

Download and read the Ebola dataset:

  • Estimate the effective reproduction number using EpiNow2
  • Adjust the estimate by the available reporting delays in {epiparameter}
  • Why did you choose that parameter?

To calculate the \(R_t\) using EpiNow2, we need:

  • Aggregated incidence data, with confirmed cases per day, and
  • The generation time distribution.
  • Optionally, reporting delays distributions when available (e.g., incubation period).

To get delay distribution using {epiparameter} we can use functions like:

  • epidist_db()
  • list_distributions()
  • discretise()
  • quantile()

R

# read data
# e.g.: if path to file is data/raw-data/ebola_cases.csv then:
ebola_confirmed <-
  read_csv(here::here("data", "raw-data", "ebola_cases.csv"))

# list distributions
epidist_db(disease = "ebola") %>%
  list_distributions()

R

# generation time ---------------------------------------------------------

# subset one distribution for the generation time
ebola_serial <- epidist_db(
  disease = "ebola",
  epi_dist = "serial",
  single_epidist = TRUE
)

# adapt epiparameter to epinow2
ebola_serial_discrete <- discretise(ebola_serial)

serial_interval_ebola <-
  dist_spec(
    mean = ebola_serial$summary_stats$mean,
    sd = ebola_serial$summary_stats$sd,
    max = quantile(ebola_serial_discrete, p = 0.999),
    distribution = "gamma"
  )

# incubation time ---------------------------------------------------------

# subset one distribution for delay of the incubation period
ebola_incubation <- epidist_db(
  disease = "ebola",
  epi_dist = "incubation",
  single_epidist = TRUE
)

# adapt epiparameter to epinow2
ebola_incubation_discrete <- discretise(ebola_incubation)

incubation_period_ebola <-
  dist_spec(
    mean = ebola_incubation$summary_stats$mean,
    sd = ebola_incubation$summary_stats$sd,
    max = quantile(ebola_serial_discrete, p = 0.999),
    distribution = "gamma"
  )

# epinow ------------------------------------------------------------------

# run epinow
epinow_estimates_egi <- epinow(
  # cases
  reported_cases = ebola_confirmed,
  # delays
  generation_time = generation_time_opts(serial_interval_ebola),
  delays = delay_opts(incubation_period_ebola)
)

OUTPUT

WARN [2024-05-04 09:49:21] epinow: There were 9 divergent transitions after warmup. See
https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them. - 
WARN [2024-05-04 09:49:21] epinow: Examine the pairs() plot to diagnose sampling problems
 - 

R

plot(epinow_estimates_egi)

What to do with Weibull distributions?

Use the influenza_england_1978_school dataset from the outbreaks package to calculate the effective reproduction number using EpiNow2 adjusting by the available reporting delays in {epiparameter}.

EpiNow2::dist_spec() also accepts Probability Mass Functions (PMF) from any distribution family. Read the reference guide on Specify a distribution.

R

# What parameters are available for Influenza?
epidist_db(disease = "influenza") %>%
  list_distributions() %>%
  as_tibble() %>%
  count(epi_distribution)

OUTPUT

# A tibble: 3 × 2
  epi_distribution      n
  <chr>             <int>
1 generation time       1
2 incubation period    15
3 serial interval       1

R

# generation time ---------------------------------------------------------

# Read the generation time
influenza_generation <-
  epidist_db(
    disease = "influenza",
    epi_dist = "generation"
  )

influenza_generation

OUTPUT

Disease: Influenza
Pathogen: Influenza-A-H1N1
Epi Distribution: generation time
Study: Lessler J, Reich N, Cummings D, New York City Department of Health and
Mental Hygiene Swine Influenza Investigation Team (2009). "Outbreak of
2009 Pandemic Influenza A (H1N1) at a New York City School." _The New
England Journal of Medicine_. doi:10.1056/NEJMoa0906089
<https://doi.org/10.1056/NEJMoa0906089>.
Distribution: weibull
Parameters:
  shape: 2.360
  scale: 3.180

R

# EpiNow2 currently accepts Gamma or LogNormal
# other can pass the PMF function

influenza_generation_discrete <-
  epiparameter::discretise(influenza_generation)

influenza_generation_max <-
  quantile(influenza_generation_discrete, p = 0.999)

influenza_generation_pmf <-
  density(
    influenza_generation_discrete,
    at = 1:influenza_generation_max
  )

influenza_generation_pmf

OUTPUT

[1] 0.063123364 0.221349877 0.297212205 0.238968280 0.124851641 0.043094538
[7] 0.009799363

R

# EpiNow2::dist_spec() can also accept the PMF values
generation_time_influenza <-
  dist_spec(
    pmf = influenza_generation_pmf
  )

# incubation period -------------------------------------------------------

# Read the incubation period
influenza_incubation <-
  epidist_db(
    disease = "influenza",
    epi_dist = "incubation",
    single_epidist = TRUE
  )

# Discretize incubation period
influenza_incubation_discrete <-
  epiparameter::discretise(influenza_incubation)

influenza_incubation_max <-
  quantile(influenza_incubation_discrete, p = 0.999)

influenza_incubation_pmf <-
  density(
    influenza_incubation_discrete,
    at = 1:influenza_incubation_max
  )

influenza_incubation_pmf

OUTPUT

[1] 0.057491512 0.166877052 0.224430917 0.215076318 0.161045462 0.097466092
[7] 0.048419279 0.019900259 0.006795222

R

# EpiNow2::dist_spec() can also accept the PMF values
incubation_time_influenza <-
  dist_spec(
    pmf = influenza_incubation_pmf
  )

# epinow ------------------------------------------------------------------

# Read data
influenza_cleaned <-
  outbreaks::influenza_england_1978_school %>%
  select(date, confirm = in_bed)

# Run epinow()
epinow_estimates_igi <- epinow(
  # cases
  reported_cases = influenza_cleaned,
  # delays
  generation_time = generation_time_opts(generation_time_influenza),
  delays = delay_opts(incubation_time_influenza)
)

plot(epinow_estimates_igi)

Next steps


How to get distribution parameters from statistical distributions?

How to get the mean and standard deviation from a generation time with only distribution parameters but no summary statistics like mean or sd for EpiNow2::dist_spec()?

Look at the {epiparameter} vignette on parameter extraction and conversion and its use cases!

How to estimate delay distributions for Disease X?

Refer to this excellent tutorial on estimating the serial interval and incubation period of Disease X accounting for censoring using Bayesian inference with packages like rstan and coarseDataTools.

Then, after you get your estimated values, you can manually create your own<epidist> class objects with epiparameter::epidist()! Take a look at its reference guide on “Create an <epidist> object”!

Key Points

  • Use distribution functions with <epidist> objects to get summary statistics and informative parameters for public health interventions like the Window for contact tracing and Length of quarantine.
  • Use discretise() to convert continuous to discrete delay distributions.
  • Use {epiparameter} to get reporting delays required in transmissibility estimates.