Content from Read case data
Last updated on 2024-11-14 | Edit this page
Overview
Questions
- Where do you usually store your outbreak data?
- How many different data formats can I read?
- Is it possible to import data from databases and health APIs?
Objectives
- Explain how to import outbreak data from different sources into
R
environment.
Prerequisites
This episode requires you to be familiar with:
Data science : Basic programming with R.
Introduction
The initial step in outbreak analysis involves importing the target
dataset into the R
environment from various sources.
Outbreak data is typically stored in files of diverse formats,
relational database management systems (RDBMS), or health information
system (HIS) application program interfaces (APIs) such as REDCap, DHIS2, etc. The latter option is
particularly well-suited for storing institutional health data. This
episode will elucidate the process of reading cases from these
sources.
Let’s start by loading the package rio to read data
and the package here to easily find a file path within
your RStudio project. We’ll use the pipe %>%
to connect
some of their functions, including others from the package
dplyr, so let’s also call to the tidyverse package:
R
# Load packages
library(tidyverse) # for {dplyr} functions and the pipe %>%
library(rio) # for importing data
library(here) # for easy file referencing
The double-colon
The double-colon ::
in R let you call a specific
function from a package without loading the entire package into the
current environment.
For example, dplyr::filter(data, condition)
uses
filter()
from the dplyr package.
This help us remember package functions and avoid namespace conflicts.
Setup a project and folder
- Create an RStudio project. If needed, follow this how-to guide on “Hello RStudio Projects” to create one.
- Inside the RStudio project, create the
data/
folder. - Inside the
data/
folder, save the ebola_cases_2.csv and marburg.zip files.
Reading from files
Several packages are available for importing outbreak data stored in
individual files into R
. These include rio, readr from the
tidyverse
, io, ImportExport,
and data.table.
Together, these packages offer methods to read single or multiple files
in a wide range of formats.
The below example shows how to import a csv
file into
R
environment using rio package.
R
# read data
# e.g., the path to our file is data/raw-data/ebola_cases_2.csv then:
ebola_confirmed <- rio::import(
here::here("data", "ebola_cases_2.csv")
) %>%
dplyr::as_tibble() # for a simple data frame output
# preview data
ebola_confirmed
OUTPUT
# A tibble: 120 × 4
year month day confirm
<int> <int> <int> <int>
1 2014 5 18 1
2 2014 5 20 2
3 2014 5 21 4
4 2014 5 22 6
5 2014 5 23 1
6 2014 5 24 2
7 2014 5 26 10
8 2014 5 27 8
9 2014 5 28 2
10 2014 5 29 12
# ℹ 110 more rows
Similarly, you can import files of other formats such as
tsv
, xlsx
, … etc.
Why should we use the {here} package?
The here package is designed to simplify file referencing in R projects by providing a reliable way to construct file paths relative to the project root. The main reason to use it is Cross-Environment Compatibility.
It works across different operating systems (Windows, Mac, Linux) without needing to adjust file paths.
- On Windows, paths are written using backslashes (
\
) as the separator between folder names:"data\raw-data\file.csv"
- On Unix based operating system such as macOS or Linux the forward
slash (
/
) is used as the path separator:"data/raw-data/file.csv"
The here package is ideal for adding one more layer of reproducibility to your work. If you are interested in reproducibility, we invite you to read this tutorial to increase the openess, sustainability, and reproducibility of your epidemic analysis with R
Reading compressed data
Take 1 minute: Can you read data from a compressed file in
R
? Download this zip
file containing data for Marburg outbreak and then import it to your
working environment.
You can check the full list of supported file formats in the rio package on the package website. To expand {rio} to the full range of support for import and export formats run:
R
rio::install_formats()
You can use this template to read the file:
rio::import(here::here("some", "where", "downto", "path", "file_name.zip"))
R
rio::import(here::here("data", "Marburg.zip"))
Reading from databases
The DBI package serves as a versatile interface for interacting with database management systems (DBMS) across different back-ends or servers. It offers a uniform method for accessing and retrieving data from various database systems.
When to read directly from a database?
We can use database interface packages to optimize memory usage. If we process the database with “queries” (e.g., select, filter, summarise) before extraction, we can reduce the memory load in our RStudio session. Conversely, conducting all data manipulation outside the database management system can lead to occupying more disk space than desired running out of memory.
The following code chunk demonstrates in four steps how to create a
temporary SQLite database in memory, store the
ebola_confirmed
as a table on it, and subsequently read
it:
1. Connect with a database
First, we establish a connection to an SQLite database created in
memory using DBI::dbConnect()
.
R
library(DBI)
library(RSQLite)
# Create a temporary SQLite database in memory
db_connection <- DBI::dbConnect(
drv = RSQLite::SQLite(),
dbname = ":memory:"
)
Callout
A real-life connection would look like this:
R
# in real-life
db_connection <- DBI::dbConnect(
RSQLite::SQLite(),
host = "database.epiversetrace.com",
user = "juanito",
password = epiversetrace::askForPassword("Database password")
)
2. Write a local data frame as a table in a database
Then, we can write the ebola_confirmed
into a table
named cases
within the database using the
DBI::dbWriteTable()
function.
R
# Store the 'ebola_confirmed' dataframe as a table named 'cases'
# in the SQLite database
DBI::dbWriteTable(
conn = db_connection,
name = "cases",
value = ebola_confirmed
)
In a database framework, you can have more than one table. Each table
can belong to a specific entity
(e.g., patients, care
units, jobs). All tables will be related by a common ID or
primary key
.
3. Read data from a table in a database
Subsequently, we reads the data from the cases
table
using dplyr::tbl()
.
R
# Read one table from the database
mytable_db <- dplyr::tbl(src = db_connection, "cases")
If we apply dplyr verbs to this database SQLite table, these verbs will be translated to SQL queries.
R
# Show the SQL queries translated
mytable_db %>%
dplyr::filter(confirm > 50) %>%
dplyr::arrange(desc(confirm)) %>%
dplyr::show_query()
OUTPUT
<SQL>
SELECT `cases`.*
FROM `cases`
WHERE (`confirm` > 50.0)
ORDER BY `confirm` DESC
4. Extract data from the database
Use dplyr::collect()
to force computation of a database
query and extract the output to your local computer.
R
# Pull all data down to a local tibble
extracted_data <- mytable_db %>%
dplyr::filter(confirm > 50) %>%
dplyr::arrange(desc(confirm)) %>%
dplyr::collect()
The extracted_data
object represents the extracted,
ideally after specifying queries that reduces its size.
R
# View the extracted_data
extracted_data
OUTPUT
# A tibble: 3 × 4
year month day confirm
<int> <int> <int> <int>
1 2014 9 16 84
2 2014 9 15 68
3 2014 9 17 56
Run SQL queries in R using dbplyr
Practice how to make relational database SQL queries using multiple
dplyr verbs like dplyr::left_join()
among
tables before pulling down data to your local session with
dplyr::collect()
!
You can also review the dbplyr R package. But for a step-by-step tutorial about SQL, we recommend you this tutorial about data management with SQL for Ecologist. You will find close to dplyr!
Reading from HIS APIs
Health related data are also increasingly stored in specialized HIS
APIs like Fingertips, GoData,
REDCap, and DHIS2. In such case one
can resort to readepi package,
which enables reading data from HIS-APIs.
-[TBC]
Key Points
- Use
{rio}, {io}, {readr}
and{ImportExport}
to read data from individual files. - Use
{readepi}
to read data form HIS APIs and RDBMS.
Content from Clean case data
Last updated on 2024-11-14 | Edit this page
Overview
Questions
- How to clean and standardize case data?
Objectives
- Explain how to clean, curate, and standardize case data using cleanepi package
- Perform essential data-cleaning operations to be performed in a raw case dataset.
Prerequisite
This episode requires you to:
- Download the simulated_ebola_2.csv
- Save it in the
data/
folder.
Introduction
In the process of analyzing outbreak data, it’s essential to ensure that the dataset is clean, curated, standardized, and valid to facilitate accurate and reproducible analysis. This episode focuses on cleaning epidemics and outbreaks data using the cleanepi package, For demonstration purposes, we’ll work with a simulated dataset of Ebola cases.
Let’s start by loading the package rio to read data
and the package cleanepi to clean it. We’ll use the pipe
%>%
to connect some of their functions, including others
from the package dplyr, so let’s also call to the
tidyverse package:
R
# Load packages
library(tidyverse) # for {dplyr} functions and the pipe %>%
library(rio) # for importing data
library(here) # for easy file referencing
library(cleanepi)
The double-colon
The double-colon ::
in R let you call a specific
function from a package without loading the entire package into the
current environment.
For example, dplyr::filter(data, condition)
uses
filter()
from the dplyr package.
This help us remember package functions and avoid namespace conflicts.
The first step is to import the dataset into working environment,
which can be done by following the guidelines outlined in the Read case data episode. This involves loading
the dataset into R
environment and view its structure and
content.
R
# Read data
# e.g.: if path to file is data/simulated_ebola_2.csv then:
raw_ebola_data <- rio::import(
here::here("data", "simulated_ebola_2.csv")
) %>%
dplyr::as_tibble() # for a simple data frame output
R
# Print data frame
raw_ebola_data
OUTPUT
# A tibble: 15,003 × 9
V1 `case id` age gender status `date onset` `date sample` lab region
<int> <int> <chr> <chr> <chr> <chr> <chr> <lgl> <chr>
1 1 14905 90 1 "conf… 03/15/2015 06/04/2015 NA valdr…
2 2 13043 twenty… 2 "" Sep /11/13 03/01/2014 NA valdr…
3 3 14364 54 f <NA> 09/02/2014 03/03/2015 NA valdr…
4 4 14675 ninety <NA> "" 10/19/2014 31/ 12 /14 NA valdr…
5 5 12648 74 F "" 08/06/2014 10/10/2016 NA valdr…
6 5 12648 74 F "" 08/06/2014 10/10/2016 NA valdr…
7 6 14274 sevent… female "" Apr /05/15 01/23/2016 NA valdr…
8 7 14132 sixteen male "conf… Dec /29/Y 05/10/2015 NA valdr…
9 8 14715 44 f "conf… Apr /06/Y 04/24/2016 NA valdr…
10 9 13435 26 1 "" 09/07/2014 20/ 09 /14 NA valdr…
# ℹ 14,993 more rows
Discussion
Let’s diagnose the data frame. List all the characteristics in the data frame above that are problematic for data analysis.
Are any of those characteristics familiar with any previous data analysis you performed?
A quick inspection
Quick exploration and inspection of the dataset are crucial before
diving into any analysis tasks. The cleanepi package
simplifies this process with the scan_data()
function.
Let’s take a look at how you can use it:
R
cleanepi::scan_data(raw_ebola_data)
OUTPUT
Field_names missing numeric date character logical
1 age 0.0646 0.8348 0.0000 0.1006 0
2 gender 0.1578 0.0472 0.0000 0.7950 0
3 status 0.0535 0.0000 0.0000 0.9465 0
4 date onset 0.0001 0.0000 0.9159 0.0840 0
5 date sample 0.0001 0.0000 0.9999 0.0000 0
6 region 0.0000 0.0000 0.0000 1.0000 0
The results provide an overview of the content of every column, including column names, and the percent of some data types per column. You can see that the column names in the dataset are descriptive but lack consistency, as some they are composed of multiple words separated by white spaces. Additionally, some columns contain more than one data type, and there are missing values in others.
Common operations
This section demonstrate how to perform some common data cleaning operations using the cleanepi package.
Standardizing column names
For this example dataset, standardizing column names typically
involves removing spaces and connecting different words with “_”. This
practice helps maintain consistency and readability in the dataset.
However, the function used for standardizing column names offers more
options. Type ?cleanepi::standardize_column_names
for more
details.
R
sim_ebola_data <- cleanepi::standardize_column_names(raw_ebola_data)
names(sim_ebola_data)
OUTPUT
[1] "v1" "case_id" "age" "gender" "status"
[6] "date_onset" "date_sample" "lab" "region"
If you want to maintain certain column names without subjecting them
to the standardization process, you can utilize the keep
argument of the function
cleanepi::standardize_column_names()
. This argument accepts
a vector of column names that are intended to be kept unchanged.
Challenge
What differences you can observe in the column names?
Standardize the column names of the input dataset, but keep the first column names as it is.
You can try
cleanepi::standardize_column_names(data = raw_ebola_data, keep = "V1")
Removing irregularities
Raw data may contain irregularities such as
duplicated rows, empty rows and
columns, or constant columns (where all entries have
the same value.) Functions from cleanepi like
remove_duplicates()
and remove_constants()
remove such irregularities as demonstrated in the below code chunk.
R
# Remove constants
sim_ebola_data <- cleanepi::remove_constants(sim_ebola_data)
Now, print the output to identify what constant column you removed!
R
# Remove duplicates
sim_ebola_data <- cleanepi::remove_duplicates(sim_ebola_data)
OUTPUT
Found 5 duplicated rows in the dataset. Please consult the report for more details.
You can get the number and location of the duplicated rows that where
found. Run cleanepi::print_report()
, wait for the report to
open in your browser, and find the “Duplicates” tab.
R
# Print a report
cleanepi::print_report(sim_ebola_data)
Challenge
In the following data frame:
OUTPUT
# A tibble: 6 × 5
col1 col2 col3 col4 col5
<dbl> <dbl> <chr> <chr> <date>
1 1 1 a b NA
2 2 3 a b NA
3 NA NA a <NA> NA
4 NA NA a <NA> NA
5 NA NA a <NA> NA
6 NA NA <NA> <NA> NA
What columns or rows are:
- duplicates?
- empty?
- constant?
Duplicates mostly refers to replicated rows. Empty rows or columns can be a subset within the set of constant rows or columns.
Replacing missing values
In addition to the regularities, raw data can contain missing values
that may be encoded by different strings, including the empty. To ensure
robust analysis, it is a good practice to replace all missing values by
NA
in the entire dataset. Below is a code snippet
demonstrating how you can achieve this in cleanepi:
R
sim_ebola_data <- cleanepi::replace_missing_values(
data = sim_ebola_data,
na_strings = ""
)
sim_ebola_data
OUTPUT
# A tibble: 15,000 × 8
v1 case_id age gender status date_onset date_sample row_id
<int> <int> <chr> <chr> <chr> <chr> <chr> <int>
1 1 14905 90 1 confirmed 03/15/2015 06/04/2015 1
2 2 13043 twenty-five 2 <NA> Sep /11/13 03/01/2014 2
3 3 14364 54 f <NA> 09/02/2014 03/03/2015 3
4 4 14675 ninety <NA> <NA> 10/19/2014 31/ 12 /14 4
5 5 12648 74 F <NA> 08/06/2014 10/10/2016 5
6 6 14274 seventy-six female <NA> Apr /05/15 01/23/2016 7
7 7 14132 sixteen male confirmed Dec /29/Y 05/10/2015 8
8 8 14715 44 f confirmed Apr /06/Y 04/24/2016 9
9 9 13435 26 1 <NA> 09/07/2014 20/ 09 /14 10
10 10 14816 thirty f <NA> 06/29/2015 06/02/2015 11
# ℹ 14,990 more rows
Validating subject IDs
Each entry in the dataset represents a subject and should be
distinguishable by a specific column formatted in a particular way, such
as falling within a specified range, containing certain prefixes and/or
suffixes, containing a specific number of characters. The
cleanepi package offers the function
check_subject_ids()
designed precisely for this task as
shown in the below code chunk. This function validates whether they are
unique and meet the required criteria.
R
sim_ebola_data <-
cleanepi::check_subject_ids(
data = sim_ebola_data,
target_columns = "case_id",
range = c(0, 15000)
)
OUTPUT
Found 1957 duplicated rows in the subject IDs. Please consult the report for more details.
Note that our simulated dataset does contain duplicated subject IDS.
Let’s print a preliminary report with
cleanepi::print_report(sim_ebola_data)
. Focus on the
“Unexpected subject ids” tab to identify what IDs require an extra
treatment.
After finishing this tutorial, we invite you to explore the package reference guide of cleanepi to find the function that can fix this situation.
Standardizing dates
Certainly, an epidemic dataset contains date columns for different events, such as the date of infection, date of symptoms onset, etc. These dates can come in different date formats, and it is good practice to standardize them. The cleanepi package provides functionality for converting date columns of epidemic datasets into ISO format, ensuring consistency across the different date columns. Here’s how you can use it on our simulated dataset:
R
sim_ebola_data <- cleanepi::standardize_dates(
sim_ebola_data,
target_columns = c(
"date_onset",
"date_sample"
)
)
sim_ebola_data
OUTPUT
# A tibble: 15,000 × 8
v1 case_id age gender status date_onset date_sample row_id
<int> <chr> <chr> <chr> <chr> <date> <date> <int>
1 1 14905 90 1 confirmed 2015-03-15 2015-06-04 1
2 2 13043 twenty-five 2 <NA> 2013-09-11 2014-03-01 2
3 3 14364 54 f <NA> 2014-09-02 2015-03-03 3
4 4 14675 ninety <NA> <NA> 2014-10-19 2031-12-14 4
5 5 12648 74 F <NA> 2014-08-06 2016-10-10 5
6 6 14274 seventy-six female <NA> 2015-04-05 2016-01-23 7
7 7 14132 sixteen male confirmed NA 2015-05-10 8
8 8 14715 44 f confirmed NA 2016-04-24 9
9 9 13435 26 1 <NA> 2014-09-07 2020-09-14 10
10 10 14816 thirty f <NA> 2015-06-29 2015-06-02 11
# ℹ 14,990 more rows
This function converts the values in the target columns, or will
automatically figure out the date columns within the dataset (if
target_columns = NULL
) and convert them into the
Ymd format.
How is this possible?
We invite you to find the key package that works internally by reading the Details section of the Standardize date variables reference manual!
Converting to numeric values
In the raw dataset, some column can come with mixture of character
and numerical values, and you want to convert the character values
explicitly into numeric. For example, in our simulated data set, in the
age column some entries are written in words. In cleanepi
the function convert_to_numeric()
does such conversion as
illustrated in the below code chunk.
R
sim_ebola_data <- cleanepi::convert_to_numeric(sim_ebola_data,
target_columns = "age"
)
sim_ebola_data
OUTPUT
# A tibble: 15,000 × 8
v1 case_id age gender status date_onset date_sample row_id
<int> <chr> <dbl> <chr> <chr> <date> <date> <int>
1 1 14905 90 1 confirmed 2015-03-15 2015-06-04 1
2 2 13043 25 2 <NA> 2013-09-11 2014-03-01 2
3 3 14364 54 f <NA> 2014-09-02 2015-03-03 3
4 4 14675 90 <NA> <NA> 2014-10-19 2031-12-14 4
5 5 12648 74 F <NA> 2014-08-06 2016-10-10 5
6 6 14274 76 female <NA> 2015-04-05 2016-01-23 7
7 7 14132 16 male confirmed NA 2015-05-10 8
8 8 14715 44 f confirmed NA 2016-04-24 9
9 9 13435 26 1 <NA> 2014-09-07 2020-09-14 10
10 10 14816 30 f <NA> 2015-06-29 2015-06-02 11
# ℹ 14,990 more rows
Multiple language support
Thanks to the numberize package, we can convert numbers written as English, French or Spanish words to positive integer values!
Epidemiology related operations
In addition to common data cleansing tasks, such as those discussed in the above section, the cleanepi package offers additional functionalities tailored specifically for processing and analyzing outbreak and epidemic data. This section covers some of these specialized tasks.
Checking sequence of dated-events
Ensuring the correct order and sequence of dated events is crucial in
epidemiological data analysis, especially when analyzing infectious
diseases where the timing of events like symptom onset and sample
collection is essential. The cleanepi package provides a
helpful function called check_date_sequence()
precisely for
this purpose.
Here’s an example code chunk demonstrating the usage of the function
check_date_sequence()
in our simulated Ebola dataset
R
sim_ebola_data <- cleanepi::check_date_sequence(
data = sim_ebola_data,
target_columns = c("date_onset", "date_sample")
)
This functionality is crucial for ensuring data integrity and accuracy in epidemiological analyses, as it helps identify any inconsistencies or errors in the chronological order of events, allowing you to address them appropriately.
Let’s print another preliminary report with
cleanepi::print_report(sim_ebola_data)
. Focus on the
“Incorrect date sequence” tab to identify what IDs had this issue.
Dictionary-based substitution
In the realm of data pre-processing, it’s common to encounter scenarios where certain columns in a dataset, such as the “gender” column in our simulated Ebola dataset, are expected to have specific values or factors. However, it’s also common for unexpected or erroneous values to appear in these columns, which need to be replaced with appropriate values. The cleanepi package offers support for dictionary-based substitution, a method that allows you to replace values in specific columns based on mappings defined in a dictionary. This approach ensures consistency and accuracy in data cleaning.
Moreover, cleanepi provides a built-in dictionary specifically tailored for epidemiological data. The example dictionary below includes mappings for the “gender” column.
R
test_dict <- base::readRDS(
system.file("extdata", "test_dict.RDS", package = "cleanepi")
) %>%
dplyr::as_tibble() # for a simple data frame output
test_dict
OUTPUT
# A tibble: 6 × 4
options values grp orders
<chr> <chr> <chr> <int>
1 1 male gender 1
2 2 female gender 2
3 M male gender 3
4 F female gender 4
5 m male gender 5
6 f female gender 6
Now, we can use this dictionary to standardize values of the the “gender” column according to predefined categories. Below is an example code chunk demonstrating how to utilize this functionality:
R
sim_ebola_data <- cleanepi::clean_using_dictionary(
sim_ebola_data,
dictionary = test_dict
)
sim_ebola_data
OUTPUT
# A tibble: 15,000 × 8
v1 case_id age gender status date_onset date_sample row_id
<int> <chr> <dbl> <chr> <chr> <date> <date> <int>
1 1 14905 90 male confirmed 2015-03-15 2015-06-04 1
2 2 13043 25 female <NA> 2013-09-11 2014-03-01 2
3 3 14364 54 female <NA> 2014-09-02 2015-03-03 3
4 4 14675 90 <NA> <NA> 2014-10-19 2031-12-14 4
5 5 12648 74 female <NA> 2014-08-06 2016-10-10 5
6 6 14274 76 female <NA> 2015-04-05 2016-01-23 7
7 7 14132 16 male confirmed NA 2015-05-10 8
8 8 14715 44 female confirmed NA 2016-04-24 9
9 9 13435 26 male <NA> 2014-09-07 2020-09-14 10
10 10 14816 30 female <NA> 2015-06-29 2015-06-02 11
# ℹ 14,990 more rows
This approach simplifies the data cleaning process, ensuring that categorical data in epidemiological datasets is accurately categorized and ready for further analysis.
Note that, when the column in the dataset contains values that are
not in the dictionary, the function
cleanepi::clean_using_dictionary()
will raise an error.
You can start a custom dictionary with a data frame inside or outside
R. You can use the function cleanepi::add_to_dictionary()
to include new elements in the dictionary. For example:
R
new_dictionary <- tibble::tibble(
options = "0",
values = "female",
grp = "sex",
orders = 1L
) %>%
cleanepi::add_to_dictionary(
option = "1",
value = "male",
grp = "sex",
order = NULL
)
new_dictionary
OUTPUT
# A tibble: 2 × 4
options values grp orders
<chr> <chr> <chr> <int>
1 0 female sex 1
2 1 male sex 2
You can read more details in the section about “Dictionary-based data substituting” in the package “Get started” vignette.
Calculating time span between different date events
In epidemiological data analysis, it is also useful to track and analyze time-dependent events, such as the progression of a disease outbreak (i.e., the time difference between today and the first case reported) or the duration between sample collection and analysis (i.e., the time difference between today and the sample collection). The most common example is to calculate the age of all the subjects given their date of birth (i.e., the time difference between today and the date of birth).
The cleanepi package offers a convenient function for
calculating the time elapsed between two dated events at different time
scales. For example, the below code snippet utilizes the function
cleanepi::timespan()
to compute the time elapsed since the
date of sample for the case identified until the date this document was
generated (2024-11-14).
R
sim_ebola_data <- cleanepi::timespan(
sim_ebola_data,
target_column = "date_sample",
end_date = Sys.Date(),
span_unit = "years",
span_column_name = "years_since_collection",
span_remainder_unit = "months"
)
sim_ebola_data %>%
dplyr::select(case_id, date_sample, years_since_collection, remainder_months)
OUTPUT
# A tibble: 15,000 × 4
case_id date_sample years_since_collection remainder_months
<chr> <date> <dbl> <dbl>
1 14905 2015-06-04 9 5
2 13043 2014-03-01 10 8
3 14364 2015-03-03 9 8
4 14675 2031-12-14 -7 0
5 12648 2016-10-10 8 1
6 14274 2016-01-23 8 9
7 14132 2015-05-10 9 6
8 14715 2016-04-24 8 6
9 13435 2020-09-14 4 2
10 14816 2015-06-02 9 5
# ℹ 14,990 more rows
After executing the function cleanepi::timespan()
, two
new columns named years_since_collection
and
remainder_months
are added to the
sim_ebola_data dataset, containing the calculated time
elapsed since the date of sample collection for each case, measured in
years, and the remaining time measured in months.
Challenge
Age data is useful in any downstream analysis. You can categorize it to generate stratified estimates.
Read the test_df.RDS
data frame within the
cleanepi package:
R
dat <- readRDS(
file = system.file("extdata", "test_df.RDS", package = "cleanepi")
) %>%
dplyr::as_tibble()
Calculate the age in years of the subjects with date of birth, and the remainder time un months. Clean and standardize the required elements to get this done.
Before calculating the age, you may need to:
- standardize column names
- standardize dates columns
- replace missing as strings to a valid missing entry
In the solution we add date_first_pcr_positive_test
given that it will provide the temporal scale for descriptive and
statistical downstream analysis of the disease outbreak.
R
dat_clean <- dat %>%
# standardize column names and dates
cleanepi::standardize_column_names() %>%
cleanepi::standardize_dates(
target_columns = c("date_of_birth", "date_first_pcr_positive_test")
) %>%
# replace from strings to a valid missing entry
cleanepi::replace_missing_values(
target_columns = "sex",
na_strings = "-99"
) %>%
# calculate the age in 'years' and return the remainder in 'months'
cleanepi::timespan(
target_column = "date_of_birth",
end_date = Sys.Date(),
span_unit = "years",
span_column_name = "age_in_years",
span_remainder_unit = "months"
)
Now, How would you categorize a numerical variable?
The simplest alternative is using Hmisc::cut2()
. You can
also use dplyr::case_when()
however, this requires more
lines of code and is more appropriate for custom categorizations. Here
we provide one solution using base::cut()
:
R
dat_clean %>%
# select to conveniently view timespan output
dplyr::select(
study_id,
sex,
date_first_pcr_positive_test,
date_of_birth,
age_in_years
) %>%
# categorize the age numerical variable [add as a challenge hint]
dplyr::mutate(
age_category = base::cut(
x = age_in_years,
breaks = c(0, 20, 35, 60, Inf), # replace with max value if known
include.lowest = TRUE,
right = FALSE
)
)
OUTPUT
# A tibble: 10 × 6
study_id sex date_first_pcr_posit…¹ date_of_birth age_in_years age_category
<chr> <int> <date> <date> <dbl> <fct>
1 PS001P2 1 2020-12-01 1972-06-01 52 [35,60)
2 PS002P2 1 2021-01-01 1952-02-20 72 [60,Inf]
3 PS004P2… NA 2021-02-11 1961-06-15 63 [60,Inf]
4 PS003P2 1 2021-02-01 1947-11-11 77 [60,Inf]
5 P0005P2 2 2021-02-16 2000-09-26 24 [20,35)
6 PS006P2 2 2021-05-02 NA NA <NA>
7 PB500P2 1 2021-02-19 1989-11-03 35 [35,60)
8 PS008P2 2 2021-09-20 1976-10-05 48 [35,60)
9 PS010P2 1 2021-02-26 1991-09-23 33 [20,35)
10 PS011P2 2 2021-03-03 1991-02-08 33 [20,35)
# ℹ abbreviated name: ¹date_first_pcr_positive_test
You can investigate the maximum values of variables using
skimr::skim()
. Instead of base::cut()
you can
also use Hmisc::cut2(x = age_in_years,cuts = c(20,35,60))
,
which gives calculate the maximum value and do not require more
arguments.
Multiple operations at once
Performing data cleaning operations individually can be
time-consuming and error-prone. The cleanepi package
simplifies this process by offering a convenient wrapper function called
clean_data()
, which allows you to perform multiple
operations at once.
The clean_data()
function applies a series of predefined
data cleaning operations to the input dataset. Here’s an example code
chunk illustrating how to use clean_data()
on a raw
simulated Ebola dataset:
Further more, you can combine multiple data cleaning tasks via the pipe operator in “%>%”, as shown in the below code snippet.
R
# Perfom the cleaning operations using the pipe (%>%) operator
cleaned_data <- raw_ebola_data %>%
cleanepi::standardize_column_names() %>%
cleanepi::remove_constants() %>%
cleanepi::remove_duplicates() %>%
cleanepi::replace_missing_values(na_strings = "") %>%
cleanepi::check_subject_ids(
target_columns = "case_id",
range = c(1, 15000)
) %>%
cleanepi::standardize_dates(
target_columns = c("date_onset", "date_sample")
) %>%
cleanepi::convert_to_numeric(target_columns = "age") %>%
cleanepi::check_date_sequence(
target_columns = c("date_onset", "date_sample")
) %>%
cleanepi::clean_using_dictionary(dictionary = test_dict) %>%
cleanepi::timespan(
target_column = "date_sample",
end_date = Sys.Date(),
span_unit = "years",
span_column_name = "years_since_collection",
span_remainder_unit = "months"
)
Cleaning report
The cleanepi package generates a comprehensive report detailing the findings and actions of all data cleansing operations conducted during the analysis. This report is presented as a webpage with multiple sections. Each section corresponds to a specific data cleansing operation, and clicking on each section allows you to access the results of that particular operation. This interactive approach enables users to efficiently review and analyze the outcomes of individual cleansing steps within the broader data cleansing process.
You can view the report using the functioncleanepi::print_report(cleaned_data)
.
Content from Validate case data
Last updated on 2024-11-14 | Edit this page
Overview
Questions
- How to convert raw dataset into a
linelist
object?
Objectives
- Demonstrate how to covert case data to
linelist
data - Demonstrate how to validate data
Prerequisite
This episode requires you to:
- Download the cleaned_data.csv
- Save it in the
data/
folder.
Introduction
In outbreak analysis, once you have completed the initial steps of reading and cleaning the case data, it’s essential to establish an additional foundation layer to ensure the integrity and reliability of subsequent analyses. Specifically, this involves verifying the presence and correct data type of certain columns within your dataset, a process commonly referred to as “tagging.” Additionally. it’s crucial to implement measures to validate that these tagged columns are not inadvertently deleted during further data processing steps.
This episode focuses tagging and validate outbreak data using the linelist package.
Let’s start by loading the package rio to read data and
the package linelist to create a linelist. We’ll use the
pipe %>%
to connect some of their functions, including
others from the package dplyr, so let’s also call to the
tidyverse package:
R
# Load packages
library(tidyverse) # for {dplyr} functions and the pipe %>%
library(rio) # for importing data
library(here) # for easy file referencing
library(linelist) # for taggin and validating
The double-colon
The double-colon ::
in R let you call a specific
function from a package without loading the entire package into the
current environment.
For example, dplyr::filter(data, condition)
uses
filter()
from the dplyr package.
This help us remember package functions and avoid namespace conflicts.
Import the dataset following the guidelines outlined in the Read case data episode. This involves loading the dataset into the working environment and view its structure and content.
R
# Read data
# e.g.: if path to file is data/simulated_ebola_2.csv then:
cleaned_data <- rio::import(
here::here("data", "cleaned_data.csv")
) %>%
dplyr::as_tibble() # for a simple data frame output
OUTPUT
# A tibble: 15,000 × 10
v1 case_id age gender status date_onset date_sample row_id
<int> <int> <dbl> <chr> <chr> <IDate> <IDate> <int>
1 1 14905 90 male confirmed 2015-03-15 2015-06-04 1
2 2 13043 25 female <NA> 2013-09-11 2014-03-01 2
3 3 14364 54 female <NA> 2014-09-02 2015-03-03 3
4 4 14675 90 <NA> <NA> 2014-10-19 2031-12-14 4
5 5 12648 74 female <NA> 2014-08-06 2016-10-10 5
6 6 14274 76 female <NA> 2015-04-05 2016-01-23 7
7 7 14132 16 male confirmed NA 2015-05-10 8
8 8 14715 44 female confirmed NA 2016-04-24 9
9 9 13435 26 male <NA> 2014-09-07 2020-09-14 10
10 10 14816 30 female <NA> 2015-06-29 2015-06-02 11
# ℹ 14,990 more rows
# ℹ 2 more variables: years_since_collection <int>, remainder_months <int>
Discussion
An unexpected change
You are in an emergency response situation. You need to generate daily situation reports. You automated your analysis to read data directly from the online server 😁. However, the people in charge of the data collection/administration needed to remove/rename/reformat one variable you found helpful 😞!
How can you detect if the data input is still valid to replicate the analysis code you wrote the day before?
Creating a linelist and tagging elements
Then we convert the cleaned case data into a linelist
object using linelist package, see the below code
chunk.
R
# Create a linelist object from cleaned data
linelist_data <- linelist::make_linelist(
x = cleaned_data, # Input data
id = "case_id", # Column for unique case identifiers
date_onset = "date_onset", # Column for date of symptom onset
gender = "gender" # Column for gender
)
# Display the resulting linelist object
linelist_data
OUTPUT
// linelist object
# A tibble: 15,000 × 10
v1 case_id age gender status date_onset date_sample row_id
<int> <int> <dbl> <chr> <chr> <IDate> <IDate> <int>
1 1 14905 90 male confirmed 2015-03-15 2015-06-04 1
2 2 13043 25 female <NA> 2013-09-11 2014-03-01 2
3 3 14364 54 female <NA> 2014-09-02 2015-03-03 3
4 4 14675 90 <NA> <NA> 2014-10-19 2031-12-14 4
5 5 12648 74 female <NA> 2014-08-06 2016-10-10 5
6 6 14274 76 female <NA> 2015-04-05 2016-01-23 7
7 7 14132 16 male confirmed NA 2015-05-10 8
8 8 14715 44 female confirmed NA 2016-04-24 9
9 9 13435 26 male <NA> 2014-09-07 2020-09-14 10
10 10 14816 30 female <NA> 2015-06-29 2015-06-02 11
# ℹ 14,990 more rows
# ℹ 2 more variables: years_since_collection <int>, remainder_months <int>
// tags: id:case_id, date_onset:date_onset, gender:gender
The linelist package supplies tags for common
epidemiological variables and a set of appropriate data types for each.
You can view the list of available tags by the variable name and their
acceptable data types for each using
linelist::tags_types()
.
Challenge
Let’s tag more variables. In new datasets, it will be frequent to have variable names different to the available tag names. However, we can associate them based on how variables were defined for data collection.
Now:
- Explore the available tag names in {linelist}.
- Find what other variables in the cleaned dataset can be associated with any of these available tags.
-
Tag those variables as above using
linelist::make_linelist()
.
Your can get access to the list of available tag names in {linelist} using:
R
# Get a list of available tags by name and data types
linelist::tags_types()
# Get a list of names only
linelist::tags_names()
R
linelist::make_linelist(
x = cleaned_data,
id = "case_id",
date_onset = "date_onset",
gender = "gender",
age = "age", # same name in default list and dataset
date_reporting = "date_sample" # different names but related
)
How these additional tags are visible in the output?
Validation
To ensure that all tagged variables are standardized and have the
correct data types, use the linelist::validate_linelist()
,
as shown in the example below:
R
linelist::validate_linelist(linelist_data)
Challenge
Let’s validate tagged variables. Let’s simulate that in an ongoing outbreak; the next day, your data has a new set of entries (i.e., rows or observations) but one variable change of data type.
For example, let’s make the variable age
change of type
from a double (<dbl>
) variable to character
(<chr>
).
To simulate it:
- Change the variable data type,
- Tag the variable into a linelist, and then
- Validate it.
Describe how linelist::validate_linelist()
reacts when
input data has a different variable data type.
We can use dplyr::mutate()
to change the variable type
before tagging for validation. For example:
R
cleaned_data %>%
# simulate a change of data type in one variable
dplyr::mutate(age = as.character(age)) %>%
# tag one variable
linelist::... %>%
# validate the linelist
linelist::...
Please run the code line by line, focusing only on the parts before the pipe (
%>%
). After each step, observe the output before moving to the next line.
If the age
variable changes from double
(<dbl>
) to character (<chr>
) we
get the following:
R
cleaned_data %>%
# simulate a change of data type in one variable
dplyr::mutate(age = as.character(age)) %>%
# tag one variable
linelist::make_linelist(
age = "age"
) %>%
# validate the linelist
linelist::validate_linelist()
ERROR
Error: Some tags have the wrong class:
- age: Must inherit from class 'numeric'/'integer', but has class 'character'
Why are we getting an Error
message?
Explore other situations to understand this behavior. Let’s try these additional changes to variables:
-
date_onset
changes from a<date>
variable to character (<chr>
), -
gender
changes from a character (<chr>
) variable to integer (<int>
).
Then tag them into a linelist for validation. Does the
Error
message propose to us the solution?
R
# Change 2
# Run this code line by line to identify changes
cleaned_data %>%
# simulate a change of data type
dplyr::mutate(date_onset = as.character(date_onset)) %>%
# tag
linelist::make_linelist(
date_onset = "date_onset"
) %>%
# validate
linelist::validate_linelist()
R
# Change 3
# Run this code line by line to identify changes
cleaned_data %>%
# simulate a change of data type
dplyr::mutate(gender = as.factor(gender)) %>%
dplyr::mutate(gender = as.integer(gender)) %>%
# tag
linelist::make_linelist(
gender = "gender"
) %>%
# validate
linelist::validate_linelist()
We get Error
messages because of the mismatch between
the predefined tag type (from linelist::tags_types()
) and
the tagged variable class in the linelist.
The Error
message inform us that in order to
validate our linelist, we must fix the input variable
type to fit the expected tag type. In a data analysis script, we can do
this by adding one cleaning step into the pipeline.
Challenge
What step along the linelist workflow of tagging and validating would response to the absence of a variable?
About losing variables, you can simulate this scenario:
R
cleaned_data %>%
# simulate a change of data type in one variable
select(-age) %>%
# tag one variable
linelist::make_linelist(
age = "age"
)
ERROR
Error in base::tryCatch(base::withCallingHandlers({: 1 assertions failed:
* Variable 'tag': Must be element of set
* {'v1','case_id','gender','status','date_onset','date_sample','row_id','years_since_collection','remainder_months'},
* but is 'age'.
Safeguarding
Safeguarding is implicitly built into the linelist objects. If you try to drop any of the tagged columns, you will receive an error or warning message, as shown in the example below.
R
new_df <- linelist_data %>%
dplyr::select(case_id, gender)
WARNING
Warning: The following tags have lost their variable:
date_onset:date_onset
This Warning
message above is the default output option
when we lose tags in a linelist
object. However, it can be
changed to an Error
message using
linelist::lost_tags_action()
.
Challenge
Let’s test the implications of changing the
safeguarding configuration from a Warning
to an Error
message.
- First, run this code to count the frequency per category within a categorical variable:
R
linelist_data %>%
dplyr::select(case_id, gender) %>%
dplyr::count(gender)
- Set behavior for lost tags in a
linelist
to “error” as follows:
R
# set behavior to "error"
linelist::lost_tags_action(action = "error")
- Now, re-run the above code segment with
dplyr::count()
.
Identify:
- What is the difference in the output between a
Warning
and anError
? - What could be the implications of this change for your daily data analysis pipeline during an outbreak response?
Deciding between Warning
or Error
message
will depend on the level of attention or flexibility you need when
losing tags. One will alert you about a change but will continue running
the code downstream. The other will stop your analysis pipeline and the
rest will not be executed.
A data reading, cleaning and validation script may require a more stable or fixed pipeline. An exploratory data analysis may require a more flexible approach. These two processes can be isolated in different scripts or repositories to adjust the safeguarding according to your needs.
Before you continue, set the configuration back again to the default
option of Warning
:
R
# set behavior to the default option: "warning"
linelist::lost_tags_action()
OUTPUT
Lost tags will now issue a warning.
A linelist
object resembles a data frame but offers
richer features and functionalities. Packages that are linelist-aware
can leverage these features. For example, you can extract a data frame
of only the tagged columns using the linelist::tags_df()
function, as shown below:
R
linelist::tags_df(linelist_data)
OUTPUT
# A tibble: 15,000 × 3
id date_onset gender
<int> <IDate> <chr>
1 14905 2015-03-15 male
2 13043 2013-09-11 female
3 14364 2014-09-02 female
4 14675 2014-10-19 <NA>
5 12648 2014-08-06 female
6 14274 2015-04-05 female
7 14132 NA male
8 14715 NA female
9 13435 2014-09-07 male
10 14816 2015-06-29 female
# ℹ 14,990 more rows
This allows, the extraction of use tagged-only columns in downstream analysis, which will be useful for the next episode!
When should I use
{linelist}
?
Data analysis during an outbreak response or mass-gathering surveillance demands a different set of “data safeguards” if compared to usual research situations. For example, your data will change or be updated over time (e.g. new entries, new variables, renamed variables).
linelist is more appropriate for this type of ongoing or long-lasting analysis. Check the “Get started” vignette section about When you should consider using {linelist}? for more information.
Key Points
- Use linelist package to tag, validate, and prepare case data for downstream analysis.
Content from Aggregate and visualize
Last updated on 2024-11-14 | Edit this page
Overview
Questions
- How to aggregate case data?
- How to visualize aggregated data?
- What is distribution of cases in time, place, gender, age?
Objectives
- Simulate synthetic outbreak data
- Convert linelist data to incidence
- Create epidemic curves from incidence data
Introduction
In an analytic pipeline, exploratory data analysis (EDA) is an important step before formal modelling. EDA helps determine relationships between variables and summarize their main characteristics often by means of data visualization.
This episode focuses on EDA of outbreak data using a few essential R packages. A key aspect of EDA in epidemic analysis is identifying the relationship between time and the observed epidemic outcome, such as confirmed cases, hospitalizations, deaths, and recoveries across different locations and demographic factors, including gender, age, and more.
Let’s start by loading the package incidence2 to
aggregate linelist data by groups and visualize epicurves. We’ll use
{simulist}
to simulate outbreak data, and
{tracetheme}
for complementary figure formatting. We’ll use
the pipe %>%
to connect some of their functions,
including others from the packages dplyr and
ggplot2, so let’s also call to the tidyverse package:
R
# Load packages
library(incidence2) # For aggregating and visualising
library(simulist) # For simulating linelist data
library(tracetheme) # For formatting figures
library(tidyverse) # For {dplyr} and {ggplot2} functions and the pipe %>%
The double-colon
The double-colon ::
in R let you call a specific
function from a package without loading the entire package into the
current environment.
For example, dplyr::filter(data, condition)
uses
filter()
from the dplyr package. This help
us remember package functions and avoid namespace conflicts.
Synthetic outbreak data
To illustrate the process of conducting EDA on outbreak data, we will
generate a line list for a hypothetical disease outbreak utilizing the
{simulist}
package. {simulist}
generates
simulation data for outbreak according to a given configuration. Its
minimal configuration can generate a linelist as shown in the below code
chunk
R
# Simulate linelist data for an outbreak with size between 1000 and 1500
set.seed(1) # Set seed for reproducibility
sim_data <- simulist::sim_linelist(outbreak_size = c(1000, 1500)) %>%
dplyr::as_tibble() # for a simple data frame output
WARNING
Warning: Number of cases exceeds maximum outbreak size.
Returning data early with 1546 cases and 3059 total contacts (including cases).
R
# Display the simulated dataset
sim_data
OUTPUT
# A tibble: 1,546 × 12
id case_name case_type sex age date_onset date_admission outcome
<int> <chr> <chr> <chr> <int> <date> <date> <chr>
1 1 Kaylin Alberts probable f 70 2023-01-01 2023-01-06 recove…
2 3 Guirnalda Azuc… probable f 25 2023-01-11 2023-01-18 died
3 6 Kevin Lee suspected m 80 2023-01-18 NA recove…
4 8 Ashraf al-Raha… probable m 8 2023-01-23 2023-02-01 recove…
5 11 Jacob Miller probable m 69 2023-01-30 NA recove…
6 14 Rocky Bustillos suspected m 40 2023-01-24 2023-01-29 recove…
7 15 Jim Soriano confirmed m 37 2023-01-31 NA recove…
8 16 Abdul Wadood e… suspected m 67 2023-01-30 NA recove…
9 20 Kristy Neish probable f 57 2023-01-27 NA recove…
10 21 Azeema al-Shab… confirmed f 70 2023-02-09 2023-02-13 died
# ℹ 1,536 more rows
# ℹ 4 more variables: date_outcome <date>, date_first_contact <date>,
# date_last_contact <date>, ct_value <dbl>
This linelist dataset offers individual-level information about the outbreak.
This is the default configuration of {simulist}
, if you
want to know more about its functionalities check the documentation
website.
You can also find data sets from real emergencies from the past at
the {outbreaks}
R
package.
Aggregating
Downstream analysis involves working with aggregated data rather than
individual cases. This requires grouping linelist data in the form of
incidence data. The incidence2
package offers an essential function, called
incidence2::incidence()
, for grouping case data, usually
centered around dated events and/or other factors. The code chunk
provided below demonstrates the creation of an
<incidence2>
class object from the simulated Ebola
linelist
data based on the date of onset.
R
# Create an incidence object by aggregating case data based on the date of onset
dialy_incidence <- incidence2::incidence(
sim_data,
date_index = "date_onset",
interval = "day" # Aggregate by daily intervals
)
# View the incidence data
dialy_incidence
OUTPUT
# incidence: 232 x 3
# count vars: date_onset
date_index count_variable count
<date> <chr> <int>
1 2023-01-01 date_onset 1
2 2023-01-11 date_onset 1
3 2023-01-18 date_onset 1
4 2023-01-23 date_onset 1
5 2023-01-24 date_onset 1
6 2023-01-27 date_onset 2
7 2023-01-29 date_onset 1
8 2023-01-30 date_onset 2
9 2023-01-31 date_onset 2
10 2023-02-01 date_onset 1
# ℹ 222 more rows
Furthermore, with the incidence2 package, you can specify the desired interval and categorize cases by one or more factors. Below is a code snippet demonstrating weekly cases grouped by the date of onset and gender.
R
# Group incidence data by week, accounting for sex and case type
weekly_incidence <- incidence2::incidence(
sim_data,
date_index = "date_onset",
interval = "week", # Aggregate by weekly intervals
groups = c("sex", "case_type") # Group by sex and case type
)
# View the incidence data
weekly_incidence
OUTPUT
# incidence: 202 x 5
# count vars: date_onset
# groups: sex, case_type
date_index sex case_type count_variable count
<isowk> <chr> <chr> <chr> <int>
1 2022-W52 f probable date_onset 1
2 2023-W02 f probable date_onset 1
3 2023-W03 m suspected date_onset 1
4 2023-W04 f probable date_onset 1
5 2023-W04 m confirmed date_onset 2
6 2023-W04 m probable date_onset 1
7 2023-W04 m suspected date_onset 1
8 2023-W05 f confirmed date_onset 4
9 2023-W05 f probable date_onset 2
10 2023-W05 f suspected date_onset 2
# ℹ 192 more rows
Dates Completion
When cases are grouped by different factors, it’s possible that these
groups may have different date ranges in the resulting
incidence2
object. The incidence2
package
provides a function called complete_dates()
to ensure that
an incidence object has the same range of dates for each group. By
default, missing counts will be filled with 0.
This functionality is also available as an argument within
incidence2::incidence()
adding
complete_dates = TRUE
.
R
# Create an incidence object grouped by sex, aggregating daily
dialy_incidence_2 <- incidence2::incidence(
sim_data,
date_index = "date_onset",
groups = "sex",
interval = "day", # Aggregate by daily intervals
complete_dates = TRUE # Complete missing dates in the incidence object
)
Challenge 1: Can you do it?
-
Task: Aggregate
sim_data
linelist based on admission date and case outcome in biweekly intervals, and save the results in an object calledbiweekly_incidence
.
Visualization
The incidence2
object can be visualized using the
plot()
function from the base R package. The resulting
graph is referred to as an epidemic curve, or epi-curve for short. The
following code snippets generate epi-curves for the
dialy_incidence
and weekly_incidence
incidence
objects mentioned above.
R
# Plot daily incidence data
base::plot(dialy_incidence) +
ggplot2::labs(
x = "Time (in days)", # x-axis label
y = "Dialy cases" # y-axis label
) +
tracetheme::theme_trace() # Apply the custom trace theme
R
# Plot weekly incidence data
base::plot(weekly_incidence) +
ggplot2::labs(
x = "Time (in weeks)", # x-axis label
y = "weekly cases" # y-axis label
) +
tracetheme::theme_trace() # Apply the custom trace theme
easy aesthetics
We invite you to skim the incidence2 package “Get
started” vignette. Find how you can use arguments within
plot()
to provide aesthetics to your incidence2 class
objects!
R
base::plot(weekly_incidence, fill = "sex")
Some of them include show_cases = TRUE
,
angle = 45
, and n_breaks = 5
. Give them a
try!
Challenge 2: Can you do it?
-
Task: Visualize
biweekly_incidence
object.
Curve of cumulative cases
The cumulative number of cases can be calculated using the
cumulate()
function from an incidence2
object
and visualized, as in the example below.
R
# Calculate cumulative incidence
cum_df <- incidence2::cumulate(dialy_incidence)
# Plot cumulative incidence data using ggplot2
base::plot(cum_df) +
ggplot2::labs(
x = "Time (in days)", # x-axis label
y = "weekly cases" # y-axis label
) +
tracetheme::theme_trace() # Apply the custom trace theme
Note that this function preserves grouping, i.e., if the
incidence2
object contains groups, it will accumulate the
cases accordingly.
Challenge 3: Can you do it?
-
Task: Visulaize the cumulatie cases from
biweekly_incidence
object.
Peak estimation
One can estimate the peak –the time with the highest number of
recorded cases– using the estimate_peak()
function from the
{incidence2} package. This function employs a bootstrapping method to
determine the peak time.
R
# Estimate the peak of the daily incidence data
peak <- incidence2::estimate_peak(
dialy_incidence,
n = 100, # Number of simulations for the peak estimation
alpha = 0.05, # Significance level for the confidence interval
first_only = TRUE, # Return only the first peak found
progress = FALSE # Disable progress messages
)
# Display the estimated peak
print(peak)
OUTPUT
# A tibble: 1 × 7
count_variable observed_peak observed_count bootstrap_peaks lower_ci
<chr> <date> <int> <list> <date>
1 date_onset 2023-05-01 22 <df [100 × 1]> 2023-03-26
# ℹ 2 more variables: median <date>, upper_ci <date>
This example demonstrates how to estimate the peak time using the
estimate_peak()
function at \(95%\) confidence interval and using 100
bootstrap samples.
Challenge 4: Can you do it?
-
Task: Estimate the peak time from
biweekly_incidence
object.
Visualization with ggplot2
incidence2 produces basic plots for epicurves, but additional work is required to create well-annotated graphs. However, using the ggplot2 package, you can generate more sophisticated and better-annotated epicurves. ggplot2 is a comprehensive package with many functionalities. However, we will focus on three key elements for producing epicurves: histogram plots, scaling date axes and their labels, and general plot theme annotation. The example below demonstrates how to configure these three elements for a simple incidence2 object.
R
# Define date breaks for the x-axis
breaks <- seq.Date(
from = min(as.Date(dialy_incidence$date_index, na.rm = TRUE)),
to = max(as.Date(dialy_incidence$date_index, na.rm = TRUE)),
by = 20 # every 20 days
)
# Create the plot
ggplot2::ggplot(data = dialy_incidence) +
geom_histogram(
mapping = aes(
x = as.Date(date_index),
y = count
),
stat = "identity",
color = "blue", # bar border color
fill = "lightblue", # bar fill color
width = 1 # bar width
) +
theme_minimal() + # apply a minimal theme for clean visuals
theme(
plot.title = element_text(face = "bold",
hjust = 0.5), # center and bold title
plot.subtitle = element_text(hjust = 0.5), # center subtitle
plot.caption = element_text(face = "italic",
hjust = 0), # italicized caption
axis.title = element_text(face = "bold"), # bold axis titles
axis.text.x = element_text(angle = 45, vjust = 0.5) # rotated x-axis text
) +
labs(
x = "Date", # x-axis label
y = "Number of cases", # y-axis label
title = "Daily Outbreak Cases", # plot title
subtitle = "Epidemiological Data for the Outbreak", # plot subtitle
caption = "Data Source: Simulated Data" # plot caption
) +
scale_x_date(
breaks = breaks, # set custom breaks on the x-axis
labels = scales::label_date_short() # shortened date labels
)
WARNING
Warning in geom_histogram(mapping = aes(x = as.Date(date_index), y = count), :
Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
Use the group
option in the mapping function to
visualize an epicurve with different groups. If there is more than one
grouping factor, use the facet_wrap()
option, as
demonstrated in the example below:
R
# Plot daily incidence by sex with facets
ggplot2::ggplot(data = dialy_incidence_2) +
geom_histogram(
mapping = aes(
x = as.Date(date_index),
y = count,
group = sex,
fill = sex
),
stat = "identity"
) +
theme_minimal() + # apply minimal theme
theme(
plot.title = element_text(face = "bold",
hjust = 0.5), # bold and center the title
plot.subtitle = element_text(hjust = 0.5), # center the subtitle
plot.caption = element_text(face = "italic", hjust = 0), # italic caption
axis.title = element_text(face = "bold"), # bold axis labels
axis.text.x = element_text(angle = 45,
vjust = 0.5) # rotate x-axis text for readability
) +
labs(
x = "Date", # x-axis label
y = "Number of cases", # y-axis label
title = "Daily Outbreak Cases by Sex", # plot title
subtitle = "Incidence of Cases Grouped by Sex", # plot subtitle
caption = "Data Source: Simulated Data" # caption for additional context
) +
facet_wrap(~sex) + # create separate panels by sex
scale_x_date(
breaks = breaks, # set custom date breaks
labels = scales::label_date_short() # short date format for x-axis labels
) +
scale_fill_manual(values = c("lightblue",
"lightpink")) # custom fill colors for sex
WARNING
Warning in geom_histogram(mapping = aes(x = as.Date(date_index), y = count, :
Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
Challenge 5: Can you do it?
- Task: Produce an annotated figure for biweekly_incidence using ggplot2 package.
Key Points
- Use
{simulist}
package to generate synthetic outbreak data - Use incidence2 package to aggregate case data based on a date event, and produce epidemic curves.
- Use ggplot2 package to produce better annotated epicurves.