Design Principles for {epichains}
An R package for simulating, handling, and analysing transmission chains in R
Source:vignettes/design-principles.Rmd
design-principles.Rmd
This vignette outlines the design decisions that have been taken during the development of the epichains R package, and provides some of the reasoning, and possible pros and cons of each decision.
The goal here is to make it easy to acquaint oneself with the code base in the absence of the current maintainer. This will ease future contributions and maintenance of the package.
Scope
{epichains} aims to facilitate:
- Simulation of transmission chains using branching processes methods,
- Handling of transmission chains using objected-orientated design, and
- Estimation of the likelihood of observing transmission chain
summaries (sizes or lengths). This can be achieved in two ways:
- Analytical/closed-form likelihoods that take the form
<distribution>_<chain_statistic_type>_ll>
, for example,gborel_size_ll()
andpois_length_ll()
. - Numerical likelihood simulation using
offspring_ll()
.
- Analytical/closed-form likelihoods that take the form
Additionally, the package provides mixture probability distributions
for generating offspring distributions, for example,
rborel()
.
Design decisions
Simulation functions
Simulation of branching processes are achieved through
simulate_chains()
and simulate_chain_stats()
.
For details of the underlying methods, see the theory
vignette.
The simulations are stochastic, meaning that one set of inputs can
produce varied results. The models here can also be use to explore
scenario analyses by varying the inputs. Often, in cases where there is
need for more than one run of the model and/or with more than one set of
parameter values, the inputs and outputs are stored in separate data
structures. However, this approach can be limiting when performing
scenario analyses, as the user has to manually manipulate the two
objects with reshaping and joining operations. This has the potential to
lead to errors and loss of information. Hence,
simulate_chains()
and simulate_chain_stats()
return objects of the dedicated classes <epichains>
and <epichains_summary>
respectively that store the
input parameters and the output of the simulation in a single
object.
The <epichains>
class extends the
<data.frame>
, using columns to store information
about the simulated transmission chains and the parameter values as
attributes. <data.frame>
was chosen because its
tabular structure allows us to store information in rows and columns,
and is a widely used data structure in R. Similarly, the
<epichains_summary>
class is a superclass (an
extension) of R’s <numeric>
class and stores the
parameter values as attributes.
The <epichains>
object contains information about
the whole outbreak, but key summaries are not easily deduced from a
quick glance of the object. Hence, the class has a dedicated
format()/print()
method to print the simulated transmission
chains in a manner similar to a <dataframe>
, but
accompanied by extra summary information including the number of chains
simulated, number of generations reached, and the number of infectors
created. These summaries are useful for quickly assessing the output of
the simulation.
Importantly, the <epichains>
class has a
summary()
method that returns an
<epichains_summary>
object. This is a design decision
that was taken to allow for easy coercion between an
<epichains>
object obtained from
simulate_chains()
and summaries of the
<epichains>
object otherwise attainable by a separate
run of simulate_chain_stats()
with the same parameter
values passed to simulate_chains()
.
Lastly, <epichains>
objects have an
aggregate()
method that aggregates the simulated outbreak
into cases by “generation” or “time”. This is syntactic sugar for the
dplyr::group_by()
and dplyr::summarise()
style
of aggregation with the added benefit of not taking on
dplyr
as a dependency to achieve the goal.
In summary, an <epichains>
object has the
following structure:
- Columns:
-
chain
(<integer>
) -
infector
(<integer>
) -
infectee
(<integer>
) -
generation
(<integer>
), and optionally, -
time
(<numeric>
), ifgeneration_time
is specified
-
- Attributes (See definitions in simulate_chains):
-
n_chains
, -
statistic
, -
stat_threshold
, -
offspring_dist
, and -
track_pop
(ifpop
is finite, i.e., if specified).
-
likelihood estimation
Likelihoods are estimated using the likelihood()
function. The function is designed to be flexible in two inputs:
- data can be supplied as a vector of chain summaries, a
<epichains>
object, or a<epichains_summary>
object, and - the offspring distribution can be supplied as a function, allowing the user to use a custom offspring distributions.
likelihood()
uses either analytical or numerical methods
to estimate the likelihood of observing the input chain summaries. The
analytical methods are closed-form likelihoods that take the form
.<offspring_dist>_<statistic>_ll()
, for
example, .gborel_size_ll()
and
.pois_length_ll()
and are shipped in this package. If the
user supplies an offspring distribution and a statistic for which a
solution exists, internally, it is used. If not, simulations are used to
estimate the likelihood. The numerical likelihood simulation is achieved
using .offspring_ll()
, an internal wrapper around
simulate_chain_stats()
.
The output type of likelihood()
depends on the
combination of the arguments individual
,
obs_prob
, and nsim_obs
as summarised in the
table below:
individual |
obs_prob |
Output type | Output length | Element length |
---|---|---|---|---|
FALSE |
1 | <numeric> |
1 | NA |
FALSE |
obs_prob >= 0 and obs_prob <= 1 |
<numeric> |
nsim_obs |
NA |
TRUE |
1 | <list> |
1 | input data |
TRUE |
obs_prob >= 0 and obs_prob <= 1 |
<list> |
nsim_obs |
input data |
Naming and documentation style
The package uses the following naming conventions:
Functions and arguments are named using snake_case, example,
simulate_chains()
.Internal functions are prefixed with a period, for example,
.offspring_ll()
. This is only a visual cue and does not have any technical implications.In the documentation:
Classes and objects are enclosed in angle brackets, for example,
<epichains>
.Packages are enclosed in curly braces, for example, epichains.
All function arguments are defined in sentence case and punctuated (especially with full stops).
Function titles are in imperative form.
Functions Are referred to with
function_name()
Dependencies
- Input validation:
-
checkmate: exports
many
test_*()
,check_*()
andassert_*()
functions and is available on CRAN.
-
checkmate: exports
many
Development journey
{epichains} is a successor to the bpmodels package, which was retired after the release of {epichains} v0.1.0.
{epichains} was born out of a need to refactor {bpmodels}, which led to a name change and subsequent changes in design that would have required significant disruptive changes to {bpmodels}. {epichains} is a major refactoring of {bpmodels} to provide a simulation function that accounts for susceptible depletion and population immunity without restrictions on the offspring distributions, better function and long form documentation, and an object-oriented approach to simulating and handling transmission chains in R.
Future plans include simulation of contacts of infectors, the incorporation of network effects, an object-oriented approach to estimating chain size and length likelihoods, and interoperability with the {epiparameter} package for ease of setting up various epidemiological delays.