Skip to contents

This vignette is intended to be guidance for working with probability distributions in R in the context of using delay distributions with cfr to obtain delay-corrected estimates of disease severity.

# load necessary packages
library(cfr)

# some distribution packages
library(distributional)
#> Error in get(paste0(generic, ".", class), envir = get_method_env()) : 
#>   object 'type_sum.accel' not found
library(distcrete)

A brief primer on distributions in R

R and its extension packages provide rich and extensive support for representing and working with probability distributions, and this can be seen from the CRAN probability distributions task view.

Users might already be familiar with some distributions and their related functionality — such as the probability density function, or random number generation — provided in the stats package which is loaded when R is started.

For example, the Gamma distribution’s probability density function (PDF) is represented by stats::dgamma().

# the probability density function at `x` for a Gamma distribution
dgamma(x = seq(10), shape = 5, rate = 1)
#>  [1] 0.01532831 0.09022352 0.16803136 0.19536681 0.17546737 0.13385262
#>  [7] 0.09122619 0.05725229 0.03373716 0.01891664

Using delay distribution densities in cfr

To correct for reporting delays in disease severity estimation, we are primarily interested in the PDF or PMF (probability mass function) of the distribution of reporting delays between cases and known outcomes.

We refer to the R functions providing both the PDF and PMF as the density of the distribution.

The delay distribution density must be passed to functions such as cfr_static(), cfr_time_varying(), or estimate_ascertainment() via the delay_density argument. This can be represented in pseudo-code as

cfr_*(data, delay_density = <DENSITY_FUNCTION>)

Preparing delay distribution density for cfr

Importantly, the cfr functions must receive the delay density in such a way that the density can be calculated over a flexible number of values (the sequence of days in the outbreak data).

For example, by wrapping the density function for a Gamma distribution within another function which fixes the distribution parameters and accepts a vector of numbers, it can be evaluated at any set of values specified by the vector.

# wrap stats::dgamma() in a function
# the Gamma distribution parameters are contained within dens_gamma()
dens_gamma <- function(x) {
  stats::dgamma(x = x, shape = 5, scale = 1)
}

# check over a vector of `x`
dens_gamma(x = seq(10))
#>  [1] 0.01532831 0.09022352 0.16803136 0.19536681 0.17546737 0.13385262
#>  [7] 0.09122619 0.05725229 0.03373716 0.01891664

More information on working with functions, and especially anonymous functions, can be found in the chapter on Functional Programming in Advanced R. Users working with R > 4.1.0 can also use the new syntax for anonymous functions, e.g. \(x) stats::dgamma(x, shape, scale).

Note that we use the central estimates for each distribution parameter, and by ignoring uncertainty in these parameters the uncertainty in the resulting CFR is likely to be underestimated.

Passing delay distribution density to cfr functions

delay distribution density functions can be passed, as anonymous functions, to cfr functions as shown with the example data provided with the package Camacho et al. (2014). Parameters for the onset-to-death distribution of Ebola virus disease are taken from Barry et al. (2018).

# load package data
data("ebola1976")

# pass function wrapping dgamma to cfr_static()
cfr_static(
  data = ebola1976,
  delay_density = function(x) dgamma(x, shape = 2.40, scale = 3.33)
)
#>   severity_estimate severity_low severity_high
#> 1            0.9592       0.9295        0.9793

Using other distribution representations

While many R packages provide support for representing probability distributions, we focus on two examples, distributional and distcrete, to show how closures wrapping these could be implemented, while wrapping the parameters from Barry et al. (2018).

Users may wish to use these or similar packages for better management of distributions and parameters. See the CRAN probability distributions task view for more information on distribution packages suitable for different use cases.

Using distributional

Note that the output of density(<distribution>, x) is a list containing a vector.

# using {distributional} and parameters from Barry et al. 2018
dist_onset_to_death_ebola <- dist_gamma(shape = 2.40, rate = 1.0 / 3.33)

# wrap function and pass it to cfr_static()
# unlist() required as density(<distribution>, x) is a list
cfr_static(
  data = ebola1976,
  delay_density = function(x) unlist(density(dist_onset_to_death_ebola, x))
)
#>   severity_estimate severity_low severity_high
#> 1            0.9592       0.9295        0.9793

Using distcrete

The distcrete package provides support for discrete distributions. Here, we show an example for the discrete Gamma distribution. Note that the density function for <distcrete> objects is encapsulated, and can be passed directly to the delay_density argument.

# using {distcrete} and parameters from Barry et al. 2018
dist_onset_to_death_ebola <- distcrete(
  name = "gamma", shape = 2.40, scale = 3.33, interval = 1
)

# pass density function to cfr_static()
cfr_static(
  data = ebola1976,
  delay_density = dist_onset_to_death_ebola$d
)
#>   severity_estimate severity_low severity_high
#> 1            0.9576       0.9275        0.9782

Using continuous and discrete distributions

Note that discrete distributions are the more appropriate choice to be passed to cfr_*(), as we are usually working with daily case and death data.

We do use continuous distributions in many examples as onset-to-death delays are typically long with large variance. Evaluating the probability distribution function of such distributions at discrete points (here, days) is similar to evaluating the probability mass function of the equivalent discrete distribution.

However, note that this assumption may not be appropriate for more strongly peaked distributions, i.e., where onset-to-death is strongly peaked with a low variance, as the difference between the PDF and PMF is likely to be larger on average (than for a more spread out distribution).

Further, cfr functions tally estimated death counts (calculated by convolving cases and densities), so that any underestimates due to the PDF-PMF discrepancy at one end of the distribution help to cancel out overestimates at the end of the distribution.

While users can pass functions from stats, and can manage distribution parameters using specialised packages and classes, it may be convenient to be able to access parameters reported in the epidemiological literature from a curated library.

The forthcoming epiparameter package aims to be such a library of epidemiological delay distributions. The dedicated <epiparameter> class is expected to have similar functionality to other distribution classes, allowing easy definition of density functions that can be passed to cfr.

The pseudo-code below shows how this might work.

# NOTE: this is pseudo-code
EPIPARAMETER_OBJECT <- ACCESS_DISTRIBUTION(disease, study)

cfr_*(data = data, delay_density = function(x) density(<EPIPARAMETER_OBJECT>, x))

References

Barry, Ahmadou, Steve Ahuka-Mundeke, Yahaya Ali Ahmed, Yokouide Allarangar, Julienne Anoko, Brett Nicholas Archer, Aaron Aruna Abedi, et al. 2018. “Outbreak of Ebola virus disease in the Democratic Republic of the Congo, April–May, 2018: an epidemiological study.” The Lancet 392 (10143): 213–21. https://doi.org/10.1016/S0140-6736(18)31387-4.
Camacho, A., A. J. Kucharski, S. Funk, J. Breman, P. Piot, and W. J. Edmunds. 2014. “Potential for Large Outbreaks of Ebola Virus Disease.” Epidemics 9 (December): 70–78. https://doi.org/10.1016/j.epidem.2014.09.003.