Read and clean case data, and make linelist for outbreak analytics with R: All in One View

Content from Read case data

Last updated on 2025-11-11 | Edit this page

Estimated time: 30 minutes

Overview

Questions

Where do you usually store your outbreak data?
How many different data formats can you use for analysis?
Can you import data from servers and health information systems?

Objectives

Explain how to import outbreak data from different sources into R environment.

Prerequisites

This episode requires you to be familiar with: Data science : Basic tasks with R.

Introduction

The initial step in outbreak analysis typically involves importing the target dataset into the R environment from either a local source (like a file on your computer) or external source (like a database). Outbreak data can be stored in diverse formats, relational database management systems (RDBMS), or health information systems (HIS), such as REDCap and DHIS2, which provide application program interfaces (APIs) to the database systems so verified users can easily add and access data entries. The latter option is particularly well-suited for collecting and storing large-scale institutional health data. This episode will elucidate the process of reading cases from these sources.

Let’s start by loading the package rio to read data and the package here to easily find a file path within your RStudio project. We’ll use the pipe operator (%>%) from the magrittr package to easily connect some of their functions, including functions from the data formatting package dplyr. We’ll therefore call the tidyverse package, which includes both magrittr and dplyr:

R

# Load packages
library(tidyverse) # for {dplyr} functions and the pipe %>%
library(rio) # for importing data from files
library(here) # for easy file referencing
library(readepi) # for importing data directly from RDBMS or HIS
library(dbplyr) # for a database backend for {dplyr}

The double-colon

The double-colon :: in R lets you call a specific function from a package without loading the entire package into the current environment.

For example, dplyr::filter(data, condition) uses filter() from the dplyr package, without having to use library(dplyr) at the start of a script.

This help us remember package functions and avoid namespace conflicts (i.e. when two different packages include functions with the same name, so R does not know which to use).

Setup a project and folder

Create an RStudio project. If needed, follow this how-to guide on “Hello RStudio Projects” to create one.
Inside the RStudio project, create a data/ folder.
Inside the data/ folder, save the ebola_cases_2.csv and marburg.zip CSV files.

Reading from files

Several packages are available for importing outbreak data stored in individual files into R. These include {rio}, {readr} from the tidyverse, {io}, {ImportExport}, and {data.table}. Together, these packages offer methods to read single or multiple files in a wide range of formats.

The below example shows how to import a csv file into R environment using the rio package. We use the here package to tell R to look for the file in the data/ folder of your project, and dplyr::as_tibble() to convert into a tidier format for subsequent analysis in R.

R

# read data
# e.g., the path to our file is data/raw-data/ebola_cases_2.csv then:
ebola_confirmed <- rio::import(
  here::here("data", "ebola_cases_2.csv")
) %>%
  dplyr::as_tibble() # for a simple data frame output

# preview data
ebola_confirmed

OUTPUT

# A tibble: 120 × 4
    year month   day confirm
   <int> <int> <int>   <int>
 1  2014     5    18       1
 2  2014     5    20       2
 3  2014     5    21       4
 4  2014     5    22       6
 5  2014     5    23       1
 6  2014     5    24       2
 7  2014     5    26      10
 8  2014     5    27       8
 9  2014     5    28       2
10  2014     5    29      12
# ℹ 110 more rows

Similarly, you can import files of other formats such as tsv, xlsx, … etc.

Why should we use the {here} package?

The here package is designed to simplify file referencing in R projects by providing a reliable way to construct file paths relative to the project root. The main reason to use it is Cross-Environment Compatibility.

It works across different operating systems (Windows, Mac, Linux) without needing to adjust file paths.

On Windows, paths are written using backslashes ( \ ) as the separator between folder names: "data\raw-data\file.csv"
On Unix based operating systems such as macOS or Linux the forward slash ( / ) is used as the path separator: "data/raw-data/file.csv"

The here package reinforces the reproducibility of your work across multiple operating systems. If you are interested in reproducibility, we invite you to read this tutorial to increase the openess, sustainability, and reproducibility of your epidemic analysis with R

Reading compressed data

Can you read data from a compressed file in R?

Download this zip file containing data for Marburg outbreak and then import it to your working environment.

Give me a hint

You can check the full list of supported file formats in the rio package on the package website. To expand {rio} to the full range of supported formats run:

R

rio::install_formats()

Show me the solution

R

rio::import(here::here("data", "Marburg.zip"))

Reading from databases

The readepi library contains functions that allow you to import data directly from RDBMS or HIS (through their APIs). The readepi::read_rdbms() function allows you to import data from servers such as Microsoft SQL, MySQL, PostgreSQL, and SQLite. It is primarily based on the {DBI} library, which serves as a general-purpose interface for interacting with relational database management systems (RDBMS).

When to read directly from a database?

Importing data directly from a database optimizes the memory usage in the R session. If we process the database with “queries” (e.g., select, filter, summarise) before extraction, we can reduce the memory load in our RStudio session. Conversely, conducting all data manipulation outside the database management system by loading the full dataset into R can use up much more computer memory (i.e. RAM) than is feasible on a local machine, which can lead RStudio to slow down or even freeze.

Relational database management systems (RDBMS) also have the advantage that multiple users can access, store and analyse parts of the dataset simultaneously, without having to transfer individual files, which would make it very difficult to track which version is up-to-date.

1. Connect with a database

You can use the readepi::login() function to establish a connection to the database as shown below.

R

# establish the connection to a test MySQL database
rdbms_login <- readepi::login(
  from = "mysql-rfam-public.ebi.ac.uk",
  type = "MySQL",
  user_name = "rfamro",
  password = "",
  driver_name = "",
  db_name = "Rfam",
  port = 4497
)

OUTPUT

✔ Logged in successfully!

R

rdbms_login

OUTPUT

<Pool> of MySQLConnection objects
  Objects checked out: 0
  Available in pool: 1
  Max size: Inf
  Valid: TRUE

Callout

For this example, access may be limited by organizational network restrictions, but it should work normally on home networks.

2. Access the list of tables from the database

The readepi::show_tables() function can be used to access the full list of table names from a database.

R

# get the table names
tables <- readepi::show_tables(login = rdbms_login)

tables

In a database framework, you can have more than one table. Each table can belong to a specific entity (e.g., patients, care units, jobs). All tables will be related by a common ID or primary key.

3. Read data from a table in a database

Use the readepi::read_rdbms() function to import data from a table in a database. It can take an SQL query or a list of query parameters as demonstrated in the code chuk below.

R

# import data from the 'author' table using an SQL query
dat <- readepi::read_rdbms(
  login = rdbms_login,
  query = "select * from author"
)

# import data from the 'author' table using a list of parameters
dat <- readepi::read_rdbms(
  login = rdbms_login,
  query = list(table = "author", fields = NULL, filter = NULL)
)

Alternativelly, we can read the data from the author table using dplyr::tbl().

R

# import data from the 'author' table using an SQL query
dat <- rdbms_login %>%
  dplyr::tbl(from = "author") %>%
  dplyr::filter(initials == "A") %>%
  dplyr::arrange(desc(author_id))

dat

OUTPUT

# Source:     SQL [?? x 6]
# Database:   mysql 8.0.32-24 [@mysql-rfam-public.ebi.ac.uk:/Rfam]
# Ordered by: desc(author_id)
  author_id name           last_name    initials orcid                 synonyms
      <int> <chr>          <chr>        <chr>    <chr>                 <chr>
1        46 Roth A         Roth         A        ""                    ""
2        42 Nahvi A        Nahvi        A        ""                    ""
3        32 Machado Lima A Machado Lima A        ""                    ""
4        31 Levy A         Levy         A        ""                    ""
5        27 Gruber A       Gruber       A        "0000-0003-1219-4239" ""
6        13 Chen A         Chen         A        ""                    ""
7         6 Bateman A      Bateman      A        "0000-0002-6982-4660" ""

If we apply dplyr verbs to this database SQLite table, these verbs will be translated to SQL queries.

R

# Show the SQL queries translated
dat %>%
  dplyr::show_query()

OUTPUT

<SQL>
SELECT `author`.*
FROM `author`
WHERE (`initials` = 'A')
ORDER BY `author_id` DESC

4. Extract data from the database

Use dplyr::collect() to force computation of a database query and extract the output to your local computer.

R

# Pull all data down to a local tibble
dat %>%
  dplyr::collect()

OUTPUT

# A tibble: 7 × 6
  author_id name           last_name    initials orcid                 synonyms
      <int> <chr>          <chr>        <chr>    <chr>                 <chr>
1        46 Roth A         Roth         A        ""                    ""
2        42 Nahvi A        Nahvi        A        ""                    ""
3        32 Machado Lima A Machado Lima A        ""                    ""
4        31 Levy A         Levy         A        ""                    ""
5        27 Gruber A       Gruber       A        "0000-0003-1219-4239" ""
6        13 Chen A         Chen         A        ""                    ""
7         6 Bateman A      Bateman      A        "0000-0002-6982-4660" ""

Ideally, after specifying a set of queries, we can reduce the size of the input dataset to use in the environment of our R session.

Run SQL queries in R using dbplyr

Practice how to make relational database SQL queries using multiple dplyr verbs like dplyr::left_join() among tables before pulling down data to your local session with dplyr::collect()!

You can also review the dbplyr R package. But for a step-by-step tutorial about SQL, we recommend you this tutorial about data management with SQL for Ecologist. You will find close to dplyr!

Give me a hint

R

# SELECT FEW COLUMNS FROM ONE TABLE AND LEFT JOIN WITH ANOTHER TABLE
author <- rdbms_login %>%
  dplyr::tbl(from = "author") %>%
  dplyr::select(author_id, name)

family_author <- rdbms_login %>%
  dplyr::tbl(from = "family_author") %>%
  dplyr::select(author_id, rfam_acc)

dplyr::left_join(author, family_author, keep = TRUE) %>%
  dplyr::show_query()

OUTPUT

Joining with `by = join_by(author_id)`

OUTPUT

<SQL>
SELECT
  `author`.`author_id` AS `author_id.x`,
  `name`,
  `family_author`.`author_id` AS `author_id.y`,
  `rfam_acc`
FROM `author`
LEFT JOIN `family_author`
  ON (`author`.`author_id` = `family_author`.`author_id`)

R

dplyr::left_join(author, family_author, keep = TRUE) %>%
  dplyr::collect()

OUTPUT

Joining with `by = join_by(author_id)`

OUTPUT

# A tibble: 4,874 × 4
   author_id.x name         author_id.y rfam_acc
         <int> <chr>              <int> <chr>
 1           1 Ames T                 1 RF01831
 2           2 Argasinska J           2 RF02554
 3           2 Argasinska J           2 RF02555
 4           2 Argasinska J           2 RF02722
 5           2 Argasinska J           2 RF02720
 6           2 Argasinska J           2 RF02719
 7           2 Argasinska J           2 RF02721
 8           2 Argasinska J           2 RF02670
 9           2 Argasinska J           2 RF02718
10           2 Argasinska J           2 RF02668
# ℹ 4,864 more rows

Reading from HIS APIs

Health data is increasingly stored in specialized HIS such as Fingertips, GoData, REDCap, DHIS2, SORMAS, etc. The current version of the readepi library allows importing data from DHIS2 and SORMAS.

Importing data from DHIS2

The District Health Information System DHIS2 is an open-source software that has revolutionized global health information management. The readepi::read_dhis2() function allows you to import data from the DHIS2 Tracker system via their API.

To successfully import the data from DHIS2, you will need to connect to the system using the readepi::login() function, then provide the name or ID of the target program and organisation unit.

For a given system, you can access the IDs and names of the programs and organisation units using the get_programs() and get_organisation_units() functions, respectively.

R

# establish the connection to the system
dhis2_login <- readepi::login(
  from = "https://smc.moh.gm/dhis",
  user_name = "test",
  password = "Gambia@123"
)

OUTPUT

✔ Logged in successfully!

R

# get the names and IDs of the programs
programs <- readepi::get_programs(login = dhis2_login)

# get the names and IDs of the organisation units
org_units <- readepi::get_organisation_units(login = dhis2_login)

R

# import data from DHIS2 using IDs
data <- readepi::read_dhis2(
  login = dhis2_login,
  org_unit = "GcLhRNAFppR",
  program = "E5IUQuHg3Mg"
)

# import data from DHIS2 using names
data <- readepi::read_dhis2(
  login = dhis2_login,
  org_unit = "Keneba",
  program = "Child Registration & Treatment "
)

tibble::as_tibble(data)

OUTPUT

# A tibble: 1,116 × 69
   event   tracked_entity org_unit ` SMC-CR Scan QR Code` SMC-CR Did the child…¹
   <chr>   <chr>          <chr>    <chr>                  <chr>
 1 bgSDQb… yv7MOkGD23q    Keneba   SMC23-0510989          1
 2 y4MKmP… nibnZ8h0Nse    Keneba   SMC2021-018089         1
 3 yK7VG3… nibnZ8h0Nse    Keneba   SMC2021-018089         1
 4 EmNflz… nibnZ8h0Nse    Keneba   SMC2021-018089         1
 5 UF96ms… nibnZ8h0Nse    Keneba   SMC2021-018089         1
 6 guQTwc… FomREQ2it4n    Keneba   SMC23-0510012          1
 7 jbkRkL… FomREQ2it4n    Keneba   SMC23-0510012          1
 8 AEeype… FomREQ2it4n    Keneba   SMC23-0510012          1
 9 R30SPs… E5oAWGcdFT4    Keneba   koika-smc-22897        1
10 nr03Qy… E5oAWGcdFT4    Keneba   koika-smc-22897        1
# ℹ 1,106 more rows
# ℹ abbreviated name: ¹`SMC-CR Did the child  previously received a card?`
# ℹ 64 more variables: `SMC-CR Child First Name1` <chr>,
#   `SMC-CR Child Last Name` <chr>, `SMC-CR Date of Birth` <chr>,
#   `SMC-CR Select Age Category  ` <chr>, `SMC-CR Child gender1` <chr>,
#   `SMC-CR Mother/Person responsible full name` <chr>,
#   `SMC-CR Mother/Person responsible phone number1` <chr>, …

It is important to know that not all organisation units are registered for a specific program. To find out which organisation units are running a particular program, use the get_program_org_units() function as shown in the example below.

R

# get the list of organisation units that run the program "E5IUQuHg3Mg"
target_org_units <- readepi::get_program_org_units(
  login = dhis2_login,
  program = "E5IUQuHg3Mg",
  org_units = org_units
)

tibble::as_tibble(target_org_units)

OUTPUT

# A tibble: 26 × 3
   org_unit_ids levels            org_unit_names
   <chr>        <chr>             <chr>
 1 UrLrbEiWk3J  Town/Village_name Sare Sibo
 2 wlVsFVeHSTx  Town/Village_name Jawo Kunda
 3 kp0ZYUEqJE8  Town/Village_name Chewal
 4 Wr3htgGxhBv  Town/Village_name Madinayel
 5 psyHoqeN2Tw  Town/Village_name Bolibanna
 6 MGBYonFM4y3  Town/Village_name Sare Mala
 7 GcLhRNAFppR  Town/Village_name Keneba
 8 y1Z3KuvQyhI  Town/Village_name Brikama
 9 W3vH9yBUSei  Town/Village_name Gidda
10 ISbNWYieHY8  Town/Village_name Song Kunda
# ℹ 16 more rows

Importing data from SORMAS

The Surveillance Outbreak Response Management and Analysis System SORMAS is an open-source e-health system that optimizes infectious disease surveillance and outbreak response processes. The readepi::read_sormas() function allows you to import data from SORMAS via its API.

In the current version of the readepi package, the read_sormas() function returns data for the following columns: case_id, person_id, sex, date_of_birth, case_origin, country, city, lat, long, case_status, date_onset, date_admission, date_last_contact, date_first_contact, outcome, date_outcome, Ct_values.

One of the fundamental arguments is the name of the disease for which the user wants to get data. To ensure the correct syntax to use when calling the function, you can get the list of disease names through the sormas_get_diseases() function.

R

# get the list of all disease names
disease_names <- readepi::sormas_get_diseases(
  base_url = "https://demo.sormas.org/sormas-rest",
  user_name = "SurvSup",
  password = "Lk5R7JXeZSEc"
)

tibble::as_tibble(disease_names)

OUTPUT

# A tibble: 65 × 2
   disease            active
   <chr>              <chr>
 1 AFP                TRUE
 2 CHOLERA            TRUE
 3 CONGENITAL_RUBELLA TRUE
 4 CSM                TRUE
 5 DENGUE             TRUE
 6 EVD                TRUE
 7 GUINEA_WORM        TRUE
 8 LASSA              TRUE
 9 MEASLES            TRUE
10 MONKEYPOX          TRUE
# ℹ 55 more rows

R

# import COVID-19 cases from SORMAS
covid_cases <- readepi::read_sormas(
  base_url = "https://demo.sormas.org/sormas-rest",
  user_name = "SurvSup",
  password = "Lk5R7JXeZSEc",
  disease = "coronavirus"
)

tibble::as_tibble(covid_cases)

OUTPUT

# A tibble: 5 × 16
  case_id    person_id date_onset date_admission case_origin case_status outcome
  <chr>      <chr>     <date>     <date>         <chr>       <chr>       <chr>
1 SZ3GHH-RJ… V2XMXK-K… NA         NA             IN_COUNTRY  NOT_CLASSI… NO_OUT…
2 W5C6VE-OH… SBWO4N-3… NA         NA             IN_COUNTRY  NOT_CLASSI… NO_OUT…
3 XBXV3A-TI… QXQ5VA-2… 2025-09-14 2025-09-14     IN_COUNTRY  CONFIRMED   NO_OUT…
4 SSTIVB-VS… ROTW6C-D… 2025-10-14 NA             IN_COUNTRY  NO_CASE     NO_OUT…
5 T6ZLGJ-MU… WON54L-6… NA         NA             IN_COUNTRY  NOT_CLASSI… NO_OUT…
# ℹ 9 more variables: sex <chr>, date_of_birth <chr>, country <chr>,
#   city <chr>, latitude <chr>, longitude <chr>, contact_id <chr>,
#   date_last_contact <date>, Ct_values <chr>

Key Points

Use rio, io, readr and {ImportExport} to read data from individual files.
Use readepi to read data form HIS APIs and RDBMS.

Content from Clean case data

Last updated on 2025-11-11 | Edit this page

Estimated time: 30 minutes

Overview

Questions

How to clean and standardize case data?

Objectives

Explain how to clean, curate, and standardize case data using cleanepi package
Perform essential data-cleaning operations to be performed in a raw case dataset.

Prerequisite

This episode requires you to:

Download the simulated_ebola_2.csv
Save it in the data/ folder. Follow instructions in Setup to configure an RStudio Project and folder

Introduction

In the process of analyzing outbreak data, it’s essential to ensure that the dataset is clean, curated, standardized, and validated. This will ensure that analysis is accurate (i.e. you are analysing what you think you are analysing) and reproducible (i.e. if someone wants to go back and repeat your analysis steps with your code, you can be confident they will get the same results). This episode focuses on cleaning epidemics and outbreaks data using the cleanepi package, For demonstration purposes, we’ll work with a simulated dataset of Ebola cases.

Let’s start by loading the package rio to read data and the package cleanepi to clean it. We’ll use the pipe %>% to connect some of their functions, including others from the package dplyr, so let’s also call to the tidyverse package:

R

# Load packages
library(tidyverse) # for {dplyr} functions and the pipe %>%
library(rio) # for importing data
library(here) # for easy file referencing
library(cleanepi)

The double-colon

The double-colon :: in R lets you call a specific function from a package without loading the entire package into the current environment.

For example, dplyr::filter(data, condition) uses filter() from the dplyr package.

This help us remember package functions and avoid namespace conflicts.

The first step is to import the dataset into working environment, which can be done by following the guidelines outlined in the Read case data episode. This involves loading the dataset into R environment and view its structure and content.

R

# Read data
# e.g.: if path to file is data/simulated_ebola_2.csv then:
raw_ebola_data <- rio::import(
  here::here("data", "simulated_ebola_2.csv")
) %>%
  dplyr::as_tibble() # for a simple data frame output

R

# Print data frame
raw_ebola_data

OUTPUT

# A tibble: 15,003 × 9
      V1 `case id` age     gender status `date onset` `date sample` lab   region
   <int>     <int> <chr>   <chr>  <chr>  <chr>        <chr>         <lgl> <chr>
 1     1     14905 90      1      "conf… 03/15/2015   06/04/2015    NA    valdr…
 2     2     13043 twenty… 2      ""     Sep /11/13   03/01/2014    NA    valdr…
 3     3     14364 54      f       <NA>  09/02/2014   03/03/2015    NA    valdr…
 4     4     14675 ninety  <NA>   ""     10/19/2014   31/ 12 /14    NA    valdr…
 5     5     12648 74      F      ""     08/06/2014   10/10/2016    NA    valdr…
 6     5     12648 74      F      ""     08/06/2014   10/10/2016    NA    valdr…
 7     6     14274 sevent… female ""     Apr /05/15   01/23/2016    NA    valdr…
 8     7     14132 sixteen male   "conf… Dec /29/Y    05/10/2015    NA    valdr…
 9     8     14715 44      f      "conf… Apr /06/Y    04/24/2016    NA    valdr…
10     9     13435 26      1      ""     09/07/2014   20/ 09 /14    NA    valdr…
# ℹ 14,993 more rows

Discussion

Let’s first diagnose the data frame. List all the characteristics in the data frame above that are problematic for data analysis.

Are any of those characteristics familiar from any previous data analysis you have performed?

Instructor Note

Lead a short discussion to relate the diagnosed characteristics with required cleaning operations.

You can use these terms to diagnose characteristics:

Codification, like sex and age entries using numbers, letters, and words. Also dates in different arrangement (“dd/mm/yyyy” or “yyyy/mm/dd”) and formats. Less visible, but also the column names.
Missing, how to interpret an entry like “” in status or “-99” in another column? do we have a data dictionary from the data collection process?
Inconsistencies, like having a date of sample before the date of onset.
Non-plausible values, like outlier observations with dates outside of an expected timeframe.
Duplicates, are all observations unique?

You can use these terms to relate to cleaning operations:

Standardize column name
Standardize categorical variables like sex/gender
Standardize date columns
Convert from character to numeric values
Check the sequence of dated events

A quick inspection

Quick exploration and inspection of the dataset are crucial to identify potential data issues before diving into any analysis tasks. The cleanepi package simplifies this process with the scan_data() function. Let’s take a look at how you can use it:

R

cleanepi::scan_data(raw_ebola_data)

OUTPUT

  Field_names missing numeric   date character logical
1         age  0.0646  0.8348 0.0000    0.1006       0
2      gender  0.1578  0.0472 0.0000    0.7950       0
3      status  0.0535  0.0000 0.0000    0.9465       0
4  date onset  0.0001  0.0000 0.9159    0.0840       0
5 date sample  0.0001  0.0000 0.9999    0.0000       0
6      region  0.0000  0.0000 0.0000    1.0000       0

The results provide an overview of the content of every column, including column names, and the percent of some data types per column. You can see that the column names in the dataset are descriptive but lack consistency, as some they are composed of multiple words separated by white spaces. Additionally, some columns contain more than one data type, and there are missing values in others.

Common operations

This section demonstrate how to perform some common data cleaning operations using the cleanepi package.

Standardizing column names

For this example dataset, standardizing column names typically involves removing spaces and connecting different words with “_”. This practice helps maintain consistency and readability in the dataset. However, the function used for standardizing column names offers more options. Type ?cleanepi::standardize_column_names for more details.

R

sim_ebola_data <- cleanepi::standardize_column_names(raw_ebola_data)
names(sim_ebola_data)

OUTPUT

[1] "v1"          "case_id"     "age"         "gender"      "status"
[6] "date_onset"  "date_sample" "lab"         "region"

If you want to maintain certain column names without subjecting them to the standardization process, you can utilize the keep argument of the function cleanepi::standardize_column_names(). This argument accepts a vector of column names that are intended to be kept unchanged.

Challenge

What differences you can observe in the column names?
Standardize the column names of the input dataset, but keep the first column names as it is.

Give me a hint

You can try cleanepi::standardize_column_names(data = raw_ebola_data, keep = "V1")

Removing irregularities

Raw data may contain irregularities such as duplicated rows, empty rows and columns, or constant columns (where all entries have the same value.) Functions from cleanepi like remove_duplicates() and remove_constants() remove such irregularities as demonstrated in the below code chunk.

R

# Remove constants
sim_ebola_data <- cleanepi::remove_constants(sim_ebola_data)

Now, print the output to identify what constant column you removed!

R

# Remove duplicates
sim_ebola_data <- cleanepi::remove_duplicates(sim_ebola_data)

OUTPUT

! Found 5 duplicated rows in the dataset.
ℹ Use `attr(dat, "report")[["duplicated_rows"]]` to access them, where "dat" is
  the object used to store the output from this operation.

How many rows you removed? What rows where removed?

You can get the number and location of the duplicated rows that where found. Run cleanepi::print_report(), wait for the report to open in your browser, and find the “Duplicates” tab.

To use this information within R, you can print data frames with specific sections of the report in the console using the argument what.

R

# Print a report of found duplicates
cleanepi::print_report(data = sim_ebola_data, what = "found_duplicates")

# Print a report of removed duplicates
cleanepi::print_report(data = sim_ebola_data, what = "removed_duplicates")

Challenge

In the following data frame:

OUTPUT

# A tibble: 6 × 5
   col1  col2 col3  col4  col5
  <dbl> <dbl> <chr> <chr> <date>
1     1     1 a     b     NA
2     2     3 a     b     NA
3    NA    NA a     <NA>  NA
4    NA    NA a     <NA>  NA
5    NA    NA a     <NA>  NA
6    NA    NA <NA>  <NA>  NA

What columns or rows are:

duplicates?
empty?
constant?

Give me a hint

Duplicates mostly refers to replicated rows. Empty rows or columns can be a subset within the set of constant rows or columns.

Instructor Note

duplicated rows: 3, 4, 5
empty rows: 6
empty cols: 5
constant rows: 6
constant cols: 5

Point out to learners that the user can create new constant columns or rows after removing some initial ones.

R

df %>%
  cleanepi::remove_constants()

OUTPUT

! Constant data was removed after 2 iterations.
ℹ Enter `attr(dat, "report")[["constant_data"]]` for more information, where
  "dat" represents the object used to store the output from
  `remove_constants()`.

OUTPUT

# A tibble: 2 × 2
   col1  col2
  <dbl> <dbl>
1     1     1
2     2     3

R

df %>%
  cleanepi::remove_constants() %>%
  cleanepi::remove_constants()

OUTPUT

! Constant data was removed after 2 iterations.
ℹ Enter `attr(dat, "report")[["constant_data"]]` for more information, where
  "dat" represents the object used to store the output from
  `remove_constants()`.

OUTPUT

# A tibble: 2 × 2
   col1  col2
  <dbl> <dbl>
1     1     1
2     2     3

Replacing missing values

In addition to the irregularities, raw data may contain missing values, and these may be encoded by different strings (e.g. "NA", "", character(0)). To ensure robust analysis, it is a good practice to replace all missing values by NA in the entire dataset. Below is a code snippet demonstrating how you can achieve this in cleanepi for missing entries represented by an empty string ":

R

sim_ebola_data <- cleanepi::replace_missing_values(
  data = sim_ebola_data,
  na_strings = ""
)

sim_ebola_data

OUTPUT

# A tibble: 15,000 × 8
      v1 case_id age         gender status    date_onset date_sample row_id
   <int>   <int> <chr>       <chr>  <chr>     <chr>      <chr>        <int>
 1     1   14905 90          1      confirmed 03/15/2015 06/04/2015       1
 2     2   13043 twenty-five 2      <NA>      Sep /11/13 03/01/2014       2
 3     3   14364 54          f      <NA>      09/02/2014 03/03/2015       3
 4     4   14675 ninety      <NA>   <NA>      10/19/2014 31/ 12 /14       4
 5     5   12648 74          F      <NA>      08/06/2014 10/10/2016       5
 6     6   14274 seventy-six female <NA>      Apr /05/15 01/23/2016       7
 7     7   14132 sixteen     male   confirmed Dec /29/Y  05/10/2015       8
 8     8   14715 44          f      confirmed Apr /06/Y  04/24/2016       9
 9     9   13435 26          1      <NA>      09/07/2014 20/ 09 /14      10
10    10   14816 thirty      f      <NA>      06/29/2015 06/02/2015      11
# ℹ 14,990 more rows

Validating subject IDs

Each entry in the dataset represents a subject (e.g. a disease case or study participant) and should be distinguishable by a specific column formatted in a particular way, such as falling within a specified range, containing certain prefixes and/or suffixes, containing a specific number of characters. The cleanepi package offers the function check_subject_ids() designed precisely for this task as shown in the below code chunk. This function validates whether they are unique and meet the required criteria.

R

sim_ebola_data <-
  cleanepi::check_subject_ids(
    data = sim_ebola_data,
    target_columns = "case_id",
    range = c(0, 15000)
  )

OUTPUT

! Found 1957 duplicated values in the subject Ids.
ℹ Enter `attr(dat, "report")[["duplicated_rows"]]` to access them, where "dat"
  is the object used to store the output from this operation.
ℹ No incorrect subject id was detected.

Note that our simulated dataset does contain duplicated subject IDS.

How to correct the subject IDs?

Let’s print a preliminary report with cleanepi::print_report(sim_ebola_data). Focus on the “Unexpected subject ids” tab to identify what IDs require an extra treatment.

In the console, you can print:

R

print_report(data = sim_ebola_data, "incorrect_subject_id")

After finishing this tutorial, we invite you to explore the package reference guide of cleanepi::check_subject_ids() to find the function that can fix this situation.

Standardizing dates

An epidemic dataset typically contains date columns for different events, such as the date of infection, date of symptoms onset, etc. These dates can come in different date formats, and it is good practice to standardize them to ensure that subsequent analysis is comparing like-with-like. The cleanepi package provides functionality for converting date columns of epidemic datasets into ISO format, ensuring consistency across the different date columns. Here’s how you can use it on our simulated dataset:

R

sim_ebola_data <- cleanepi::standardize_dates(
  sim_ebola_data,
  target_columns = c(
    "date_onset",
    "date_sample"
  )
)

sim_ebola_data

OUTPUT

# A tibble: 15,000 × 8
      v1 case_id age         gender status    date_onset date_sample row_id
   <int> <chr>   <chr>       <chr>  <chr>     <date>     <date>       <int>
 1     1 14905   90          1      confirmed 2015-03-15 2015-06-04       1
 2     2 13043   twenty-five 2      <NA>      2013-09-11 2014-03-01       2
 3     3 14364   54          f      <NA>      2014-09-02 2015-03-03       3
 4     4 14675   ninety      <NA>   <NA>      2014-10-19 2031-12-14       4
 5     5 12648   74          F      <NA>      2014-08-06 2016-10-10       5
 6     6 14274   seventy-six female <NA>      2015-04-05 2016-01-23       7
 7     7 14132   sixteen     male   confirmed NA         2015-05-10       8
 8     8 14715   44          f      confirmed NA         2016-04-24       9
 9     9 13435   26          1      <NA>      2014-09-07 2020-09-14      10
10    10 14816   thirty      f      <NA>      2015-06-29 2015-06-02      11
# ℹ 14,990 more rows

This function converts the values in the target columns, or will automatically figure out the date columns within the dataset (if target_columns = NULL) and convert them into the Ymd format.

How is this possible?

We invite you to find the key package that makes this standardisation possible inside cleanepi by reading the Details section of the Standardize date variables reference manual!

Also, check how to use the orders argument if you want to target US format character strings. You can explore this reproducible example.

Converting to numeric values

In the raw dataset, some columns can come with mixture of character and numerical values, and you will often want to convert character values for numbers explicitly into numeric values (e.g. "seven" to 7). For example, in our simulated data set, in the age column some entries are written in words. In cleanepi the function convert_to_numeric() does such conversion as illustrated in the below code chunk.

R

sim_ebola_data <- cleanepi::convert_to_numeric(sim_ebola_data,
  target_columns = "age"
)

sim_ebola_data

OUTPUT

# A tibble: 15,000 × 8
      v1 case_id   age gender status    date_onset date_sample row_id
   <int> <chr>   <dbl> <chr>  <chr>     <date>     <date>       <int>
 1     1 14905      90 1      confirmed 2015-03-15 2015-06-04       1
 2     2 13043      25 2      <NA>      2013-09-11 2014-03-01       2
 3     3 14364      54 f      <NA>      2014-09-02 2015-03-03       3
 4     4 14675      90 <NA>   <NA>      2014-10-19 2031-12-14       4
 5     5 12648      74 F      <NA>      2014-08-06 2016-10-10       5
 6     6 14274      76 female <NA>      2015-04-05 2016-01-23       7
 7     7 14132      16 male   confirmed NA         2015-05-10       8
 8     8 14715      44 f      confirmed NA         2016-04-24       9
 9     9 13435      26 1      <NA>      2014-09-07 2020-09-14      10
10    10 14816      30 f      <NA>      2015-06-29 2015-06-02      11
# ℹ 14,990 more rows

Multiple language support

Thanks to the numberize package, we can convert numbers written as English, French or Spanish words to positive integer values!

In addition to common data cleansing tasks, such as those discussed in the above section, the cleanepi package offers additional functionalities tailored specifically for processing and analyzing outbreak and epidemic data. This section covers some of these specialized tasks.

Checking sequence of dated-events

Ensuring the correct order and sequence of dated events is crucial in epidemiological data analysis, especially when analyzing infectious diseases where the timing of events like symptom onset and sample collection is essential. The cleanepi package provides a helpful function called check_date_sequence() precisely for this purpose.

Here’s an example of a code chunk demonstrating the usage of the function check_date_sequence() in the first 100 records of our simulated Ebola dataset

R

cleanepi::check_date_sequence(
  data = sim_ebola_data[1:100, ],
  target_columns = c("date_onset", "date_sample")
)

OUTPUT

! Detected 16 incorrect date sequences at lines: "10, 20, 22, 26, 29, 44, 46,
  54, 60, 63, 70, 71, 73, 80, 81, 90".
ℹ Enter `attr(dat, "report")[["incorrect_date_sequence"]]` to access them,
  where "dat" is the object used to store the output from this operation.

This functionality is crucial for ensuring data integrity and accuracy in epidemiological analyses, as it helps identify any inconsistencies or errors in the chronological order of events, allowing you to address them appropriately.

Dictionary-based substitution

In the realm of data pre-processing, it’s common to encounter scenarios where certain columns in a dataset, such as the “gender” column in our simulated Ebola dataset, are expected to have specific values or factors. However, it’s also common for unexpected or erroneous values to appear in these columns, which need to be replaced with appropriate values. The cleanepi package offers support for dictionary-based substitution, a method that allows you to replace values in specific columns based on mappings defined in a dictionary. This approach ensures consistency and accuracy in data cleaning.

Moreover, cleanepi provides a built-in dictionary specifically tailored for epidemiological data. The example dictionary below includes mappings for the “gender” column.

R

test_dict <- base::readRDS(
  system.file("extdata", "test_dict.RDS", package = "cleanepi")
) %>%
  dplyr::as_tibble() # for a simple data frame output

test_dict

OUTPUT

# A tibble: 6 × 4
  options values grp    orders
  <chr>   <chr>  <chr>   <int>
1 1       male   gender      1
2 2       female gender      2
3 M       male   gender      3
4 F       female gender      4
5 m       male   gender      5
6 f       female gender      6

Now, we can use this dictionary to standardize values of the the “gender” column according to predefined categories. Below is an example code chunk demonstrating how to utilize this functionality:

R

sim_ebola_data <- cleanepi::clean_using_dictionary(
  sim_ebola_data,
  dictionary = test_dict
)

sim_ebola_data

OUTPUT

# A tibble: 15,000 × 8
      v1 case_id   age gender status    date_onset date_sample row_id
   <int> <chr>   <dbl> <chr>  <chr>     <date>     <date>       <int>
 1     1 14905      90 male   confirmed 2015-03-15 2015-06-04       1
 2     2 13043      25 female <NA>      2013-09-11 2014-03-01       2
 3     3 14364      54 female <NA>      2014-09-02 2015-03-03       3
 4     4 14675      90 <NA>   <NA>      2014-10-19 2031-12-14       4
 5     5 12648      74 female <NA>      2014-08-06 2016-10-10       5
 6     6 14274      76 female <NA>      2015-04-05 2016-01-23       7
 7     7 14132      16 male   confirmed NA         2015-05-10       8
 8     8 14715      44 female confirmed NA         2016-04-24       9
 9     9 13435      26 male   <NA>      2014-09-07 2020-09-14      10
10    10 14816      30 female <NA>      2015-06-29 2015-06-02      11
# ℹ 14,990 more rows

This approach simplifies the data cleaning process, ensuring that categorical data in epidemiological datasets is accurately categorized and ready for further analysis.

How to create your own data dictionary?

Note that, when the column in the dataset contains values that are not in the dictionary, the function cleanepi::clean_using_dictionary() will raise an error.

You can start a custom dictionary with a data frame inside or outside R. You can use the function cleanepi::add_to_dictionary() to include new elements in the dictionary. For example:

R

new_dictionary <- tibble::tibble(
  options = "0",
  values = "female",
  grp = "sex",
  orders = 1L
) %>%
  cleanepi::add_to_dictionary(
    option = "1",
    value = "male",
    grp = "sex",
    order = NULL
  )

new_dictionary

OUTPUT

# A tibble: 2 × 4
  options values grp   orders
  <chr>   <chr>  <chr>  <int>
1 0       female sex        1
2 1       male   sex        2

You can read more details in the section about “Dictionary-based data substituting” in the package “Get started” vignette.

Calculating time span between different date events

In epidemiological data analysis, it is also useful to track and analyze time-dependent events, such as the progression of a disease outbreak (i.e., the time difference between today and the first case reported) or the duration between sample collection and analysis (i.e., the time difference between today and the sample collection). The most common example is to calculate the age of all the subjects given their date of birth (i.e., the time difference between today and the date of birth).

The cleanepi package offers a convenient function for calculating the time elapsed between two dated events at different time scales. For example, the below code snippet utilizes the function cleanepi::timespan() to compute the time elapsed since the date of sample for the case identified until the 3rd of January 2025 ("2025-01-03").

R

sim_ebola_data <- cleanepi::timespan(
  sim_ebola_data,
  target_column = "date_sample",
  end_date = lubridate::ymd("2025-01-03"),
  span_unit = "years",
  span_column_name = "years_since_collection",
  span_remainder_unit = "months"
)

sim_ebola_data %>%
  dplyr::select(case_id, date_sample, years_since_collection, remainder_months)

OUTPUT

# A tibble: 15,000 × 4
   case_id date_sample years_since_collection remainder_months
   <chr>   <date>                       <dbl>            <dbl>
 1 14905   2015-06-04                       9                7
 2 13043   2014-03-01                      10               10
 3 14364   2015-03-03                       9               10
 4 14675   2031-12-14                      -6              -11
 5 12648   2016-10-10                       8                2
 6 14274   2016-01-23                       8               11
 7 14132   2015-05-10                       9                7
 8 14715   2016-04-24                       8                8
 9 13435   2020-09-14                       4                3
10 14816   2015-06-02                       9                7
# ℹ 14,990 more rows

After executing the function cleanepi::timespan(), two new columns named years_since_collection and remainder_months are added to the sim_ebola_data dataset, containing the calculated time elapsed since the date of sample collection for each case, measured in years, and the remaining time measured in months.

Challenge

Age data is useful in any downstream analysis. You can categorize it to generate stratified estimates.

Read the test_df.RDS data frame within the cleanepi package:

R

dat <- readRDS(
  file = system.file("extdata", "test_df.RDS", package = "cleanepi")
) %>%
  dplyr::as_tibble()

Calculate the age in years until the 1st of March of the subjects with the date of birth, and the remainder time in months. Clean and standardize the required elements to get this done.

Give me a hint

Before calculating the age, you may need to:

standardize column names
standardize dates columns
replace missing as strings to a valid missing entry

Show me the solution

In the solution we add date_first_pcr_positive_test given that it will provide the temporal scale for descriptive and statistical downstream analysis of the disease outbreak.

R

dat_clean <- dat %>%
  # standardize column names and dates
  cleanepi::standardize_column_names() %>%
  cleanepi::standardize_dates(
    target_columns = c("date_of_birth", "date_first_pcr_positive_test")
  ) %>%
  # replace from strings to a valid missing entry
  cleanepi::replace_missing_values(
    target_columns = "sex",
    na_strings = "-99"
  ) %>%
  # calculate the age in 'years' and return the remainder in 'months'
  cleanepi::timespan(
    target_column = "date_of_birth",
    end_date = lubridate::ymd("2025-03-01"),
    span_unit = "years",
    span_column_name = "age_in_years",
    span_remainder_unit = "months"
  )

OUTPUT

! Found <numeric> values that could also be of type <Date> in column:
  date_of_birth.
ℹ It is possible to convert them into <Date> using: `lubridate::as_date(x,
  origin = as.Date("1900-01-01"))`
• where "x" represents here the vector of values from these columns
  (`data$target_column`).

Now, How would you categorize a numerical variable?

Show me the solution

The simplest alternative is using Hmisc::cut2(). You can also use dplyr::case_when() however, this requires more lines of code and is more appropriate for custom categorizations. Here we provide one solution using base::cut():

R

dat_clean %>%
  # select to conveniently view timespan output
  dplyr::select(
    study_id,
    sex,
    date_first_pcr_positive_test,
    date_of_birth,
    age_in_years
  ) %>%
  # categorize the age numerical variable [add as a challenge hint]
  dplyr::mutate(
    age_category = base::cut(
      x = age_in_years,
      breaks = c(0, 20, 35, 60, Inf), # replace with max value if known
      include.lowest = TRUE,
      right = FALSE
    )
  )

OUTPUT

# A tibble: 10 × 6
   study_id   sex date_first_pcr_posit…¹ date_of_birth age_in_years age_category
   <chr>    <int> <date>                 <date>               <dbl> <fct>
 1 PS001P2      1 2020-12-01             1972-06-01              52 [35,60)
 2 PS002P2      1 2021-01-01             1952-02-20              73 [60,Inf]
 3 PS004P2…    NA 2021-02-11             1961-06-15              63 [60,Inf]
 4 PS003P2      1 2021-02-01             1947-11-11              77 [60,Inf]
 5 P0005P2      2 2021-02-16             2000-09-26              24 [20,35)
 6 PS006P2      2 2021-05-02             NA                      NA <NA>
 7 PB500P2      1 2021-02-19             1989-11-03              35 [35,60)
 8 PS008P2      2 2021-09-20             1976-10-05              48 [35,60)
 9 PS010P2      1 2021-02-26             1991-09-23              33 [20,35)
10 PS011P2      2 2021-03-03             1991-02-08              34 [20,35)
# ℹ abbreviated name: ¹date_first_pcr_positive_test

You can investigate the maximum values of variables using skimr::skim(). Instead of base::cut() you can also use Hmisc::cut2(x = age_in_years,cuts = c(20,35,60)), which gives calculate the maximum value and do not require more arguments.

Multiple operations at once

Performing data cleaning operations individually can be time-consuming and error-prone. The cleanepi package simplifies this process by offering a convenient wrapper function called clean_data(), which allows you to perform multiple operations at once.

The clean_data() function applies a series of predefined data cleaning operations to the input dataset. Here’s an example code chunk illustrating how to use clean_data() on a raw simulated Ebola dataset:

Further more, you can combine multiple data cleaning tasks via the pipe operator in “%>%”, as shown in the below code snippet.

R

# Perfom the cleaning operations using the pipe (%>%) operator
cleaned_data <- raw_ebola_data %>%
  cleanepi::standardize_column_names() %>%
  cleanepi::remove_constants() %>%
  cleanepi::remove_duplicates() %>%
  cleanepi::replace_missing_values(na_strings = "") %>%
  cleanepi::check_subject_ids(
    target_columns = "case_id",
    range = c(1, 15000)
  ) %>%
  cleanepi::standardize_dates(
    target_columns = c("date_onset", "date_sample")
  ) %>%
  cleanepi::convert_to_numeric(target_columns = "age") %>%
  cleanepi::check_date_sequence(
    target_columns = c("date_onset", "date_sample")
  ) %>%
  cleanepi::clean_using_dictionary(dictionary = test_dict) %>%
  cleanepi::timespan(
    target_column = "date_sample",
    end_date = lubridate::ymd("2025-01-03"),
    span_unit = "years",
    span_column_name = "years_since_collection",
    span_remainder_unit = "months"
  )

Challenge

Have you noticed that cleanepi contains a set of functions to diagnose the cleaning status and another set to perform cleaning actions?

To identify both groups:

On a piece of paper, write the names of each function under the corresponding column:

Diagnose cleaning status	Perform cleaning action
…	…

Instructor Note

Notice that cleanepi contains a set of functions to diagnose the cleaning status (e.g., check_subject_ids() and check_date_sequence() in the chunk above) and another set to perform a cleaning action (the complementary functions from the chunk above).

Cleaning report

The cleanepi package generates a comprehensive report detailing the findings and actions of all data cleansing operations conducted during the analysis. This report is presented as a webpage with multiple sections. Each section corresponds to a specific data cleansing operation, and clicking on each section allows you to access the results of that particular operation. This interactive approach enables users to efficiently review and analyze the outcomes of individual cleansing steps within the broader data cleansing process.

You can view the report using:

R

cleanepi::print_report(data = cleaned_data)

Data cleaning report — Example of data cleaning report generated by cleanepi

Key Points

Use cleanepi package to clean and standardize epidemic and outbreak data
Understand how to use cleanepi to perform common data cleansing tasks and epidemiology related operations
View the data cleaning report in a browser, consult it and make decisions.

Content from Validate case data

Last updated on 2025-11-11 | Edit this page

Estimated time: 12 minutes

Overview

Questions

How to convert a raw dataset into a linelist object?

Objectives

Demonstrate how to covert case data to linelist data
Demonstrate how to tag and validate data to make analysis more reliable

Prerequisite

This episode requires you to:

Download the cleaned_data.csv
Save it in the data/ folder.

Introduction

In outbreak analysis, once you have completed the initial steps of reading and cleaning the case data, it’s essential to establish an additional foundation layer to ensure the integrity and reliability of subsequent analyses. Otherwise you might find that your analysis suddenly stops working when specific variables appear or disappear, or their underlying data types (like <date> or <chr>) change. Specifically, this additional layer involves: 1) verifying the presence and correct data type of certain columns within your dataset, a process commonly referred to as “tagging”; 2) implementing measures to check that these tagged columns are not inadvertently deleted during further data processing steps, known as “validation”.

This episode focuses tagging and validate outbreak data using the linelist package. Let’s start by loading the package rio to read data and the package linelist to create a linelist object. We’ll use the pipe %>% to connect some of their functions, including others from the package dplyr, so let’s also call to the tidyverse package:

R

# Load packages
library(tidyverse) # for {dplyr} functions and the pipe %>%
library(rio) # for importing data
library(here) # for easy file referencing
library(linelist) # for taggin and validating

The double-colon

The double-colon :: in R lets you call a specific function from a package without loading the entire package into the current environment.

For example, dplyr::filter(data, condition) uses filter() from the dplyr package.

This help us remember package functions and avoid namespace conflicts.

Import the dataset following the guidelines outlined in the Read case data episode. This involves loading the dataset into the working environment and view its structure and content.

R

# Read data
# e.g.: if path to file is data/simulated_ebola_2.csv then:
cleaned_data <- rio::import(
  here::here("data", "cleaned_data.csv")
) %>%
  dplyr::as_tibble() # for a simple data frame output

OUTPUT

# A tibble: 15,000 × 10
      v1 case_id   age gender status    date_onset date_sample row_id
   <int>   <int> <dbl> <chr>  <chr>     <IDate>    <IDate>      <int>
 1     1   14905    90 male   confirmed 2015-03-15 2015-06-04       1
 2     2   13043    25 female <NA>      2013-09-11 2014-03-01       2
 3     3   14364    54 female <NA>      2014-09-02 2015-03-03       3
 4     4   14675    90 <NA>   <NA>      2014-10-19 2031-12-14       4
 5     5   12648    74 female <NA>      2014-08-06 2016-10-10       5
 6     6   14274    76 female <NA>      2015-04-05 2016-01-23       7
 7     7   14132    16 male   confirmed NA         2015-05-10       8
 8     8   14715    44 female confirmed NA         2016-04-24       9
 9     9   13435    26 male   <NA>      2014-09-07 2020-09-14      10
10    10   14816    30 female <NA>      2015-06-29 2015-06-02      11
# ℹ 14,990 more rows
# ℹ 2 more variables: years_since_collection <int>, remainder_months <int>

Discussion

An unexpected change

You are in an emergency response situation. You need to generate daily situation reports. You automated your analysis to read data directly from the online server 😁. However, the people in charge of the data collection/administration needed to remove/rename/reformat one variable you found helpful 😞!

How can you detect if the data input is still valid to replicate the analysis code you wrote the day before?

Instructor Note

If learners do not have an experience to share, we as instructors can share one.

An scenario like this usually happens when the institution doing the analysis is not the same as the institution collecting the data. The later can make decisions about the data structure that can affect downstream processes, impacting the time or the accuracy of the analysis results.

Creating a linelist and tagging elements

Once the data is loaded and cleaned, we convert the cleaned case data into a linelist object using linelist package, as in the below code chunk.

R

# Create a linelist object from cleaned data
linelist_data <- linelist::make_linelist(
  x = cleaned_data,         # Input data
  id = "case_id",            # Column for unique case identifiers
  date_onset = "date_onset", # Column for date of symptom onset
  gender = "gender"          # Column for gender
)

# Display the resulting linelist object
linelist_data

OUTPUT


// linelist object
# A tibble: 15,000 × 10
      v1 case_id   age gender status    date_onset date_sample row_id
   <int>   <int> <dbl> <chr>  <chr>     <IDate>    <IDate>      <int>
 1     1   14905    90 male   confirmed 2015-03-15 2015-06-04       1
 2     2   13043    25 female <NA>      2013-09-11 2014-03-01       2
 3     3   14364    54 female <NA>      2014-09-02 2015-03-03       3
 4     4   14675    90 <NA>   <NA>      2014-10-19 2031-12-14       4
 5     5   12648    74 female <NA>      2014-08-06 2016-10-10       5
 6     6   14274    76 female <NA>      2015-04-05 2016-01-23       7
 7     7   14132    16 male   confirmed NA         2015-05-10       8
 8     8   14715    44 female confirmed NA         2016-04-24       9
 9     9   13435    26 male   <NA>      2014-09-07 2020-09-14      10
10    10   14816    30 female <NA>      2015-06-29 2015-06-02      11
# ℹ 14,990 more rows
# ℹ 2 more variables: years_since_collection <int>, remainder_months <int>

// tags: id:case_id, date_onset:date_onset, gender:gender

The linelist package supplies tags for common epidemiological variables and a set of appropriate data types for each. You can view the list of available tags by the variable name and their acceptable data types for each using linelist::tags_types().

Challenge

Let’s tag more variables. In new datasets, it will be frequent to have variable names different to the available tag names. However, we can associate them based on how variables were defined for data collection.

Now:

Explore the available tag names in {linelist}.
Find what other variables in the cleaned dataset can be associated with any of these available tags.
Tag those variables as above using linelist::make_linelist().

Give me a hint

Your can get access to the list of available tag names in {linelist} using:

R

# Get a list of available tags by name and data types
linelist::tags_types()

# Get a list of names only
linelist::tags_names()

Show me the solution

R

linelist::make_linelist(
  x = cleaned_data,
  id = "case_id",
  date_onset = "date_onset",
  gender = "gender",
  age = "age", # same name in default list and dataset
  date_reporting = "date_sample" # different names but related
)

How these additional tags are visible in the output?

Validation

To ensure that all tagged variables are standardized and have the correct data types, use the linelist::validate_linelist(), as shown in the example below:

R

linelist::validate_linelist(linelist_data)

Challenge

Let’s validate some tagged variables. Let’s simulate a situation in an ongoing outbreak. You wake up one day to discover that the data stream you have rely on has a new set of entries (i.e., rows or observations) and one variable that has a change of data type.

For example, let’s assume the variable age changed from a double (<dbl>) variable to character (<chr>).

To simulate this situation:

Change the variable data type,
Tag the variable into a linelist, and then
Validate it.

Describe how linelist::validate_linelist() reacts when input data has a different variable data type.

Give me a hint

We can use dplyr::mutate() to change the variable type before tagging for validation. For example:

R

cleaned_data %>%
  # simulate a change of data type in one variable
  dplyr::mutate(age = as.character(age)) %>%
  # tag one variable
  linelist::... %>%
  # validate the linelist
  linelist::...

Give me a hint

Please run the code line by line, focusing only on the parts before the pipe (%>%). After each step, observe the output before moving to the next line.

If the age variable changes from double (<dbl>) to character (<chr>) we get the following:

R

cleaned_data %>%
  # simulate a change of data type in one variable
  dplyr::mutate(age = as.character(age)) %>%
  # tag one variable
  linelist::make_linelist(
    age = "age"
  ) %>%
  # validate the linelist
  linelist::validate_linelist()

ERROR

Error: Some tags have the wrong class:
  - age: Must inherit from class 'numeric'/'integer', but has class 'character'

Why are we getting an Error message?

Explore other situations to understand this behavior. Let’s try these additional changes to variables:

date_onset changes from a <date> variable to character (<chr>),
gender changes from a character (<chr>) variable to integer (<int>).

Then tag them into a linelist for validation. Does the Error message propose to us the solution?

Show me the solution

R

# Change 2
# Run this code line by line to identify changes
cleaned_data %>%
  # simulate a change of data type
  dplyr::mutate(date_onset = as.character(date_onset)) %>%
  # tag
  linelist::make_linelist(
    date_onset = "date_onset"
  ) %>%
  # validate
  linelist::validate_linelist()

R

# Change 3
# Run this code line by line to identify changes
cleaned_data %>%
  # simulate a change of data type
  dplyr::mutate(gender = as.factor(gender)) %>%
  dplyr::mutate(gender = as.integer(gender)) %>%
  # tag
  linelist::make_linelist(
    gender = "gender"
  ) %>%
  # validate
  linelist::validate_linelist()

We get Error messages because of the mismatch between the predefined tag type (from linelist::tags_types()) and the tagged variable class in the linelist.

The Error message inform us that in order to validate our linelist, we must fix the input variable type to fit the expected tag type. In a data analysis script, we can do this by adding one cleaning step into the pipeline.

Challenge

What step along the linelist workflow of tagging and validating would response to the absence of a variable?

Show me the solution

About losing variables, you can simulate this scenario:

R

cleaned_data %>%
  # simulate a change of data type in one variable
  select(-age) %>%
  # tag one variable
  linelist::make_linelist(
    age = "age"
  )

ERROR

Error in base::tryCatch(base::withCallingHandlers({: 1 assertions failed:
 * Variable 'tag': Must be element of set
 * {'v1','case_id','gender','status','date_onset','date_sample','row_id','years_since_collection','remainder_months'},
 * but is 'age'.

Safeguarding

Safeguarding is implicitly built into the linelist objects. If you try to drop any of the tagged columns, you will receive an error or warning message, as shown in the example below.

R

new_df <- linelist_data %>%
  dplyr::select(case_id, gender)

WARNING

Warning: The following tags have lost their variable:
 date_onset:date_onset

This Warning message above is the default output option when we lose tags in a linelist object. However, it can be changed to an Error message using linelist::lost_tags_action().

Challenge

Let’s test the implications of changing the safeguarding configuration from a Warning to an Error message.

First, run this code to count the frequency per category within a categorical variable:

R

linelist_data %>%
  dplyr::select(case_id, gender) %>%
  dplyr::count(gender)

Set behavior for lost tags in a linelist to “error” as follows:

R

# set behavior to "error"
linelist::lost_tags_action(action = "error")

Now, re-run the above code segment with dplyr::count().

Identify:

What is the difference in the output between a Warning and an Error?
What could be the implications of this change for your daily data analysis pipeline during an outbreak response?

Show me the solution

Deciding between Warning or Error message will depend on the level of attention or flexibility you need when losing tags. One will alert you about a change but will continue running the code downstream. The other will stop your analysis pipeline and the rest will not be executed.

A data reading, cleaning and validation script may require a more stable or fixed pipeline. An exploratory data analysis may require a more flexible approach. These two processes can be isolated in different scripts or repositories to adjust the safeguarding according to your needs.

Before you continue, set the configuration back again to the default option of Warning:

R

# set behavior to the default option: "warning"
linelist::lost_tags_action()

OUTPUT

Lost tags will now issue a warning.

A linelist object resembles a data frame but offers richer features and functionalities. Packages that are linelist-aware can leverage these features. For example, you can extract a data frame of only the tagged columns using the linelist::tags_df() function, as shown below:

R

linelist::tags_df(linelist_data)

OUTPUT

# A tibble: 15,000 × 3
      id date_onset gender
   <int> <IDate>    <chr>
 1 14905 2015-03-15 male
 2 13043 2013-09-11 female
 3 14364 2014-09-02 female
 4 14675 2014-10-19 <NA>
 5 12648 2014-08-06 female
 6 14274 2015-04-05 female
 7 14132 NA         male
 8 14715 NA         female
 9 13435 2014-09-07 male
10 14816 2015-06-29 female
# ℹ 14,990 more rows

This allows, the extraction of use tagged-only columns in downstream analysis, which will be useful for the next episode!

When should I use `{linelist}`?

Data analysis during an outbreak response or mass-gathering surveillance demands a different set of “data safeguards” if compared to usual research situations. For example, your data will change or be updated over time (e.g. new entries, new variables, renamed variables).

linelist is more appropriate for this type of ongoing or long-lasting analysis. Check the “Get started” vignette section about When you should consider using {linelist}? for more information.

Key Points

Use linelist package to tag, validate, and prepare case data for downstream analysis.

Content from Aggregate and visualize

Last updated on 2025-11-11 | Edit this page

Estimated time: 30 minutes

Overview

Questions

How to aggregate and summarise case data?
How to visualize aggregated data?
What is distribution of cases in time, place, gender, age?

Objectives

Simulate synthetic outbreak data
Convert indivdual linelist data to incidence over time
Create epidemic curves from incidence data

Introduction

In an analytic pipeline, exploratory data analysis (EDA) is an important step before formal modelling. EDA helps determine relationships between variables and summarize their main characteristics, often by means of data visualization.

This episode focuses on EDA of outbreak data using R packages. A key aspect of EDA in epidemic analysis is ‘person, place and time’. It is useful to identify how observed events - such as confirmed cases, hospitalizations, deaths, and recoveries - change over time, and how these vary across different locations and demographic factors, including gender, age, and more.

Let’s start by loading the package incidence2 to aggregate linelist data according to specific characteristics, and visualize the resulting epidemic curves (epicurves) that plot the number of new events (i.e. incidence) over time. We’ll use simulist to simulate some outbreak data to analyse, and {tracetheme} for figure formatting. We’ll use the pipe %>% to connect some of their functions, including others from the packages dplyr and ggplot2, so let’s also call to the tidyverse package:

R

# Load packages
library(incidence2) # For aggregating and visualising
library(simulist) # For simulating linelist data
library(tracetheme) # For formatting figures
library(tidyverse) # For {dplyr} and {ggplot2} functions and the pipe %>%

The double-colon

The double-colon :: in R lets you call a specific function from a package without loading the entire package into the current environment.

For example, dplyr::filter(data, condition) uses filter() from the dplyr package. This help us remember package functions and avoid namespace conflicts.

Synthetic outbreak data

To illustrate the process of conducting EDA on outbreak data, we will generate a line list for a hypothetical disease outbreak utilizing the simulist package. simulist generates simulation data for outbreak according to a given configuration. Its minimal configuration can generate a linelist, as shown in the below code chunk:

R

# Simulate linelist data for an outbreak with size between 1000 and 1500
set.seed(1) # Set seed for reproducibility
sim_data <- simulist::sim_linelist(outbreak_size = c(1000, 1500)) %>%
  dplyr::as_tibble() # for a simple data frame output

WARNING

Warning: Number of cases exceeds maximum outbreak size.
Returning data early with 1546 cases and 3059 total contacts (including cases).

R

# Display the simulated dataset
sim_data

OUTPUT

# A tibble: 1,546 × 13
      id case_name           case_type sex     age date_onset date_reporting
   <int> <chr>               <chr>     <chr> <int> <date>     <date>
 1     1 Zahra al-Masri      probable  f        37 2023-01-01 2023-01-01
 2     3 Waleeda al-Muhammad probable  f        12 2023-01-11 2023-01-11
 3     6 Rhett Jackson       confirmed m        53 2023-01-18 2023-01-18
 4     8 Sunnique Sims       confirmed f        36 2023-01-23 2023-01-23
 5    11 Danielle Griggs     probable  f        77 2023-01-30 2023-01-30
 6    14 Mohamed Parker      probable  m        37 2023-01-24 2023-01-24
 7    15 Melissa Eriacho     probable  f        67 2023-01-31 2023-01-31
 8    16 Maria Laughlin      probable  f        80 2023-01-30 2023-01-30
 9    20 Phillip Park        confirmed m        70 2023-01-27 2023-01-27
10    21 Dewarren Newton     probable  m        87 2023-02-09 2023-02-09
# ℹ 1,536 more rows
# ℹ 6 more variables: date_admission <date>, outcome <chr>,
#   date_outcome <date>, date_first_contact <date>, date_last_contact <date>,
#   ct_value <dbl>

This linelist dataset has entries on individual-level simulated events during the outbreak.

Additional Resources on Outbreak Data

The above is the default configuration of simulist, so includes a number of assumptions about the transmissibility and severity of the pathogen. If you want to know more about sim_linelist() and other functionalities check the documentation website.

You can also find data sets from real emergencies from the past at the {outbreaks} R package.

Aggregating

Often we want to analyse and visualise the number of events that occur on a particular day or week, rather than focusing on individual cases. This requires grouping linelist data into incidence data. The incidence2 package offers a useful function called incidence2::incidence() for grouping case data, usually based around dated events and/or other characteristics. The code chunk provided below demonstrates the creation of an <incidence2> class object from the simulated Ebola linelist data based on the date of onset.

R

# Create an incidence object by aggregating case data based on the date of onset
daily_incidence <- incidence2::incidence(
  sim_data,
  date_index = "date_onset",
  interval = "day" # Aggregate by daily intervals
)

# View the incidence data
daily_incidence

OUTPUT

# incidence:  232 x 3
# count vars: date_onset
   date_index count_variable count
   <date>     <chr>          <int>
 1 2023-01-01 date_onset         1
 2 2023-01-11 date_onset         1
 3 2023-01-18 date_onset         1
 4 2023-01-23 date_onset         1
 5 2023-01-24 date_onset         1
 6 2023-01-27 date_onset         2
 7 2023-01-29 date_onset         1
 8 2023-01-30 date_onset         2
 9 2023-01-31 date_onset         2
10 2023-02-01 date_onset         1
# ℹ 222 more rows

With the incidence2 package, you can specify the desired interval (e.g. day, week) and categorize cases by one or more factors. Below is a code snippet demonstrating weekly cases grouped by the date of onset, sex, and type of case.

R

# Group incidence data by week, accounting for sex and case type
weekly_incidence <- incidence2::incidence(
  sim_data,
  date_index = "date_onset",
  interval = "week", # Aggregate by weekly intervals
  groups = c("sex", "case_type") # Group by sex and case type
)

# View the incidence data
weekly_incidence

OUTPUT

# incidence:  201 x 5
# count vars: date_onset
# groups:     sex, case_type
   date_index sex   case_type count_variable count
   <isowk>    <chr> <chr>     <chr>          <int>
 1 2022-W52   f     probable  date_onset         1
 2 2023-W02   f     probable  date_onset         1
 3 2023-W03   m     confirmed date_onset         1
 4 2023-W04   f     confirmed date_onset         2
 5 2023-W04   f     probable  date_onset         1
 6 2023-W04   m     confirmed date_onset         1
 7 2023-W04   m     probable  date_onset         1
 8 2023-W05   f     confirmed date_onset         3
 9 2023-W05   f     probable  date_onset         4
10 2023-W05   f     suspected date_onset         1
# ℹ 191 more rows

Dates Completion

When cases are grouped by different factors, it’s possible that the events involving these groups may have different date ranges in the resulting incidence2 object. The incidence2 package provides a function called complete_dates() to ensure that an incidence object has the same range of dates for each group. By default, missing counts for a particular group will be filled with 0 for that date.

This functionality is also available as an argument within incidence2::incidence() adding complete_dates = TRUE.

R

# Create an incidence object grouped by sex, aggregating daily
daily_incidence_2 <- incidence2::incidence(
  sim_data,
  date_index = "date_onset",
  groups = "sex",
  interval = "day", # Aggregate by daily intervals
  complete_dates = TRUE # Complete missing dates in the incidence object
)

Challenge 1: Can you do it?

Task: Aggregate sim_data linelist based on admission date and case outcome in biweekly intervals, and save the results in an object called biweekly_incidence.

Visualization

The incidence2 object can be visualized using the plot() function from the base R package. The resulting graph is referred to as an epidemic curve, or epi-curve for short. The following code snippets generate epi-curves for the daily_incidence and weekly_incidence incidence objects mentioned above.

R

# Plot daily incidence data
base::plot(daily_incidence) +
  ggplot2::labs(
    x = "Time (in days)", # x-axis label
    y = "Dialy cases" # y-axis label
  ) +
  tracetheme::theme_trace() # Apply the custom trace theme

R

# Plot weekly incidence data
base::plot(weekly_incidence) +
  ggplot2::labs(
    x = "Time (in weeks)", # x-axis label
    y = "weekly cases" # y-axis label
  ) +
  tracetheme::theme_trace() # Apply the custom trace theme

Easy aesthetics

We invite you to skim the incidence2 package “Get started” vignette. Find how you can use arguments within plot() to provide aesthetics to your incidence2 class objects.

R

base::plot(weekly_incidence, fill = "sex")

Some of them include show_cases = TRUE, angle = 45, and n_breaks = 5. Feel free to give them a try.

Challenge 2: Can you do it?

Task: Visualize biweekly_incidence object.

Curve of cumulative cases

The cumulative number of cases can be calculated using the cumulate() function from an incidence2 object and visualized, as in the example below.

R

# Calculate cumulative incidence
cum_df <- incidence2::cumulate(daily_incidence)

# Plot cumulative incidence data using ggplot2
base::plot(cum_df) +
  ggplot2::labs(
    x = "Time (in days)", # x-axis label
    y = "weekly cases" # y-axis label
  ) +
  tracetheme::theme_trace() # Apply the custom trace theme

Note that this function preserves grouping, i.e., if the incidence2 object contains groups, it will accumulate the cases accordingly.

Challenge 3: Can you do it?

Task: Visulaize the cumulatie cases from biweekly_incidence object.

Peak estimation

You can estimate the peak – the time with the highest number of recorded cases– using the estimate_peak() function from the {incidence2} package. This function employs a bootstrapping method to determine the peak time (i.e. by resampling dates with replacement, resulting in a distribution of estimated peak times).

R

# Estimate the peak of the daily incidence data
peak <- incidence2::estimate_peak(
  daily_incidence,
  n = 100,         # Number of simulations for the peak estimation
  alpha = 0.05,    # Significance level for the confidence interval
  first_only = TRUE, # Return only the first peak found
  progress = FALSE  # Disable progress messages
)

# Display the estimated peak
print(peak)

OUTPUT

# A tibble: 1 × 7
  count_variable observed_peak observed_count bootstrap_peaks lower_ci
  <chr>          <date>                 <int> <list>          <date>
1 date_onset     2023-05-01                22 <df [100 × 1]>  2023-03-26
# ℹ 2 more variables: median <date>, upper_ci <date>

This example demonstrates how to estimate the peak time using the estimate_peak() function at \(95%\) confidence interval and using 100 bootstrap samples.

Challenge 4: Can you do it?

Task: Estimate the peak time from biweekly_incidence object.

Visualization with ggplot2

incidence2 produces basic plots for epicurves, but additional work is required to create well-annotated graphs. However, using the ggplot2 package, you can generate more sophisticated and epicurves with more flexibility in annotation. ggplot2 is a comprehensive package with many functionalities. However, we will focus on three key elements for producing epicurves: histogram plots, scaling date axes and their labels, and general plot theme annotation. The example below demonstrates how to configure these three elements for a simple incidence2 object.

R

# Define date breaks for the x-axis
breaks <- seq.Date(
  from = min(as.Date(daily_incidence$date_index, na.rm = TRUE)),
  to = max(as.Date(daily_incidence$date_index, na.rm = TRUE)),
  by = 20 # every 20 days
)

# Create the plot
ggplot2::ggplot(data = daily_incidence) +
  geom_histogram(
    mapping = aes(
      x = as.Date(date_index),
      y = count
    ),
    stat = "identity",
    color = "blue", # bar border color
    fill = "lightblue", # bar fill color
    width = 1 # bar width
  ) +
  theme_minimal() + # apply a minimal theme for clean visuals
  theme(
    plot.title = element_text(face = "bold",
                              hjust = 0.5), # center and bold title
    plot.subtitle = element_text(hjust = 0.5), # center subtitle
    plot.caption = element_text(face = "italic",
                                hjust = 0), # italicized caption
    axis.title = element_text(face = "bold"), # bold axis titles
    axis.text.x = element_text(angle = 45, vjust = 0.5) # rotated x-axis text
  ) +
  labs(
    x = "Date", # x-axis label
    y = "Number of cases", # y-axis label
    title = "Daily Outbreak Cases", # plot title
    subtitle = "Epidemiological Data for the Outbreak", # plot subtitle
    caption = "Data Source: Simulated Data" # plot caption
  ) +
  scale_x_date(
    breaks = breaks, # set custom breaks on the x-axis
    labels = scales::label_date_short() # shortened date labels
  )

WARNING

Warning in geom_histogram(mapping = aes(x = as.Date(date_index), y = count), :
Ignoring unknown parameters: `binwidth`, `bins`, and `pad`

Use the group option in the mapping function to visualize an epicurve with different groups. If there is more than one grouping factor, use the facet_wrap() option, as demonstrated in the example below:

R

# Plot daily incidence by sex with facets
ggplot2::ggplot(data = daily_incidence_2) +
  geom_histogram(
    mapping = aes(
      x = as.Date(date_index),
      y = count,
      group = sex,
      fill = sex
    ),
    stat = "identity"
  ) +
  theme_minimal() + # apply minimal theme
  theme(
    plot.title = element_text(face = "bold",
                              hjust = 0.5), # bold and center the title
    plot.subtitle = element_text(hjust = 0.5), # center the subtitle
    plot.caption = element_text(face = "italic", hjust = 0), # italic caption
    axis.title = element_text(face = "bold"), # bold axis labels
    axis.text.x = element_text(angle = 45,
                               vjust = 0.5) # rotate x-axis text for readability
  ) +
  labs(
    x = "Date", # x-axis label
    y = "Number of cases", # y-axis label
    title = "Daily Outbreak Cases by Sex", # plot title
    subtitle = "Incidence of Cases Grouped by Sex", # plot subtitle
    caption = "Data Source: Simulated Data" # caption for additional context
  ) +
  facet_wrap(~sex) + # create separate panels by sex
  scale_x_date(
    breaks = breaks, # set custom date breaks
    labels = scales::label_date_short() # short date format for x-axis labels
  ) +
  scale_fill_manual(values = c("lightblue",
                               "lightpink")) # custom fill colors for sex

WARNING

Warning in geom_histogram(mapping = aes(x = as.Date(date_index), y = count, :
Ignoring unknown parameters: `binwidth`, `bins`, and `pad`

Challenge 5: Can you do it?

Task: Produce an annotated figure for biweekly_incidence using ggplot2 package.

Key Points

Use simulist package to generate synthetic outbreak data
Use incidence2 package to aggregate case data based on a date event, and produce epidemic curves.
Use ggplot2 package to produce better annotated epicurves.