Validate case data
Last updated on 2026-06-18 | Edit this page
Overview
Questions
- How can a raw case data be converted into a
linelistobject?
Objectives
- Demonstrate how to convert case data into
linelistdata - Demonstrate how to tag and validate data to make analysis more reliable
This episode requires you to:
- Download the cleaned_data.csv file
- Save it in the
data/folder
Introduction
In outbreak analysis, once you have completed the initial steps of
reading and cleaning the case data, it’s essential to establish an
additional fundamental layer to ensure the integrity and reliability of
subsequent analyses. Without this step, you may encounter issues later,
for example, variables may be be unintentionally modified or removed, or
their data types (like <Date> or
<character>), may change during processing. This
additional layer typically involves two key steps:
- tagging: Verifying that required columns are present in the dataset and confirming that they have the correct data types.
- validation: Implementing safeguards to ensure that tagged columns are not accidentally deleted or altered during subsequent data manipulation steps.
This episode focuses on creating linelist object using the linelist package,
which natively supports tagging and validating outbreak data o ensure
data integrity throughout the analysis workflow. Let’s start by loading
the package rio to read data and the
linelist package to create a linelist object. We’ll use
the pipe operator (%>%) to connect some of their
functions, including others from the package dplyr. For
this reason, we will also load the {tidyverse} package.
R
# Load packages
library(tidyverse) # fo {dplyr} functions and the pipe %>% operator
library(rio) # for importing data
library(here) # for easy file referencing
library(linelist) # for tagging and validating
The double-colon (::)
operator
The :: in R lets you access functions or objects from a
specific package without attaching the entire package to the search
path. It offers several important advantages, including the
following:
- Telling explicitly which package a function comes from, reducing ambiguity and potential conflicts when several packages have functions with the same name
- Allowing you to call a function from a package without loading the
whole package with
library()
For example, the command dplyr::filter(data, condition)
means we are calling the filter() function from the
dplyr package.
Import the dataset following the guidelines outlined in the Read case data episode. This involves loading the dataset into the working environment and viewing its structure and content.
R
# Read data
# e.g., if path to file is data/cleaned_data.csv then:
cleaned_data <- rio::import(
here::here("data", "cleaned_data.csv")
) %>%
dplyr::as_tibble() # for a simple data frame output
OUTPUT
# A tibble: 15,000 × 8
v1 case_id age gender status date_onset date_sample reporting_delay
<int> <int> <dbl> <chr> <chr> <IDate> <IDate> <int>
1 1 14905 90 male confirmed 2015-03-15 2015-04-06 22
2 2 13043 25 female <NA> 2013-09-11 2014-01-03 114
3 3 14364 54 female <NA> 2014-02-09 2015-03-03 387
4 4 14675 90 <NA> <NA> 2014-10-19 2014-12-31 73
5 5 12648 74 female <NA> 2014-06-08 2016-10-10 855
6 6 14274 76 female <NA> 2015-04-05 2016-01-23 293
7 7 14132 16 male confirmed NA 2015-10-05 NA
8 8 14715 44 female confirmed NA 2016-04-24 NA
9 9 13435 26 male <NA> 2014-07-09 2014-09-20 73
10 10 14816 30 female <NA> 2015-06-29 2015-02-06 -143
# ℹ 14,990 more rows
Example scenario: an unexpected change
You are in an emergency response situation. You need to generate daily situation reports. You automated your analysis to read data directly from the online server. However, the people in charge of the data collection/administration needed to remove/rename/reformat one variable you found helpful!
How can you detect if the input data is still valid to replicate the analysis code you wrote the day before?
Creating a linelist and tagging columns
Before diving in, it helps to distinguish the two steps:
tagging attaches a semantic role (such as case
ID or date of onset) to a column in your dataset, while
validation checks that the tagged columns still exist
and have the expected data types. Tagging is done once when you build
the linelist object; validation is something you can run
repeatedly as the underlying data evolves.
Once the data is loaded and cleaned, we can convert the cleaned case
data into a linelist object using the
linelist package, as in the code chunk below.
R
# Create a linelist object from cleaned data
linelist_data <- linelist::make_linelist(
x = cleaned_data, # Input data
id = "case_id", # Column for unique case identifiers
date_onset = "date_onset", # Column for date of symptom onset
gender = "gender" # Column for gender
)
# Display the resulting linelist object
linelist_data
OUTPUT
// linelist object
# A tibble: 15,000 × 8
v1 case_id age gender status date_onset date_sample reporting_delay
<int> <int> <dbl> <chr> <chr> <IDate> <IDate> <int>
1 1 14905 90 male confirmed 2015-03-15 2015-04-06 22
2 2 13043 25 female <NA> 2013-09-11 2014-01-03 114
3 3 14364 54 female <NA> 2014-02-09 2015-03-03 387
4 4 14675 90 <NA> <NA> 2014-10-19 2014-12-31 73
5 5 12648 74 female <NA> 2014-06-08 2016-10-10 855
6 6 14274 76 female <NA> 2015-04-05 2016-01-23 293
7 7 14132 16 male confirmed NA 2015-10-05 NA
8 8 14715 44 female confirmed NA 2016-04-24 NA
9 9 13435 26 male <NA> 2014-07-09 2014-09-20 73
10 10 14816 30 female <NA> 2015-06-29 2015-02-06 -143
# ℹ 14,990 more rows
// tags: id:case_id, date_onset:date_onset, gender:gender
The linelist package supplies tags for common
epidemiological variables and a set of appropriate data types for each.
You can view the list of available tag names and their acceptable data
types using the linelist::tags_types() function.
Challenge
Let’s now tag additional variables. In some datasets, variable names may not exactly match the predefined tag names. In these cases, you can map them based on how the variables were defined during data collection. You need to:
- Explore the available tag names in linelist.
- Find what other variables in the input dataset can be associated with any of these available tags.
-
Tag those variables as shown above using the
linelist::make_linelist()function.
Your can get access to the list of available tag names in linelist using:
R
# Get a list of available tags names and data types
linelist::tags_types()
# Get a list of names only
linelist::tags_names()
R
linelist::make_linelist(
x = cleaned_data,
id = "case_id",
date_onset = "date_onset",
gender = "gender",
age = "age",
# same name in default list and dataset
date_reporting = "date_sample" # different names but related
)
Are the additional tags visible in the output?
Do you want to see a display of available and tagged variables? You
can explore the function linelist::tags() and read its reference
documentation.
Validation
Recall the scenario above, where an upstream change to the data (a
removed, renamed, or reformatted variable) could quietly break your
analysis. Validation is the check that catches this: running
linelist::validate_linelist() confirms that every tagged
column is still present and still has the expected data type. In an
ongoing analysis, you can re-run it each time fresh data arrives, so
that any breaking change is flagged immediately rather than propagating
downstream.
To ensure that all tagged variables are standardized and have the
correct data types, use the linelist::validate_linelist()
function, as shown in the example below:
R
linelist::validate_linelist(linelist_data)
OUTPUT
'linelist_data' is a valid linelist object
If your dataset requires a new tag other than those defined in the
package linelist, use allow_extra = TRUE
when creating the linelist object with its corresponding
data type using the function linelist::make_linelist().
Changes in Variable Types During Linelist Validation
Let’s assume the following scenario during an ongoing outbreak. You notice at some point that the data stream you have been relying on has a set of new entries (i.e., rows or observations), and the data type of one variable has changed.
Let’s consider the example where the type of the age
variable has changed from a double (<numeric>) to
character (<character>).
To simulate this situation:
- Change the data type of the variable
-
Tag the variable into a
linelist -
Validate the
linelist
Describe how linelist::validate_linelist() reacts when
there is a change in the data type of one variable of the input
data.
We can use dplyr::mutate() to change the variable type
before tagging for validation. For example:
R
# nolint start
cleaned_data %>%
# simulate a change of data type in one variable
dplyr::mutate(age = as.character(age)) %>%
# tag one variable
linelist::.... %>%
# validate the linelist
linelist::...
# nolint end
Please run the code line by line, focusing only on the parts before the pipe (
%>%). After each step, observe the output before moving to the next line.
If the age variable changes from double
(<dbl>) to character (<chr>) we
get the following:
R
cleaned_data %>%
# simulate a change of data type in one variable
dplyr::mutate(age = as.character(age)) %>%
# tag one variable
linelist::make_linelist(age = "age") %>%
# validate the linelist
linelist::validate_linelist()
ERROR
Error:
! Some tags have the wrong class:
- age: Must inherit from class 'numeric'/'integer', but has class 'character'
Why are we getting an Error message?
Explore other situations to understand this behavior by converting:
-
date_onsetfrom<Date>to<character> -
genderfrom<character>to<integer>
Then tag them into a linelist for validation. Does the
Error message suggest a fix to the issue?
R
# Change 2
# Run this code line by line to identify changes
cleaned_data %>%
# simulate a change of data type
dplyr::mutate(date_onset = as.character(date_onset)) %>%
# tag
linelist::make_linelist(date_onset = "date_onset") %>%
# validate
linelist::validate_linelist()
R
# Change 3
# Run this code line by line to identify changes
cleaned_data %>%
# simulate a change of data type
dplyr::mutate(gender = as.factor(gender)) %>%
dplyr::mutate(gender = as.integer(gender)) %>%
# tag
linelist::make_linelist(gender = "gender") %>%
# validate
linelist::validate_linelist()
We get Error messages because the default type of these
variables in linelist::tags_types() is different from the
one we have assigned.
The Error message informs us that in order to
validate our linelist, we must fix the input variable
type to fit the expected tag type. In a data analysis script, we can do
this by adding one cleaning step into the pipeline.
Until now, a typical workflow can look like this:
R
# use cleaned data
cleaned_data %>%
# tag as many variables as possible
# creates the <linelist> class object
linelist::make_linelist(
id = "case_id",
date_onset = "date_onset",
gender = "gender"
) %>%
# validate the linelist
linelist::validate_linelist()
OUTPUT
'.' is a valid linelist object
Safeguarding
Safeguarding is implicitly built into the linelist
objects. If you try to drop any of the tagged columns, you will receive
an error or warning message, as shown in the example below.
R
new_df <- linelist_data %>%
dplyr::select(case_id, gender)
WARNING
Warning: The following tags have lost their variable:
date_onset:date_onset
The Warning message above is the default output option
when we lose tags in a linelist object. However, it can be
changed to an Error message using the
linelist::lost_tags_action() function.
Deciding between Warning or Error message
will depend on the level of attention or flexibility you need when
losing tags. A Warning will alert you about a change but
will continue running the code downstream. An Error will
stop your analysis pipeline and the rest will not be executed.
A data reading, cleaning and validation script may require a more stable or fixed pipeline. An exploratory data analysis may require a more flexible approach. These two processes can be isolated in different scripts or repositories to adjust the safeguarding according to your needs.
Before you continue, set the configuration back to the default option
of Warning:
R
# set behavior to the default option: "warning"
linelist::lost_tags_action()
OUTPUT
Lost tags will now issue a warning.
A linelist object resembles a data frame but offers
richer features and functionalities. Packages that are
linelist-aware can leverage these features. For example,
you can extract a data frame of only the tagged columns using the
linelist::tags_df() function, as shown below:
R
linelist::tags_df(linelist_data)
OUTPUT
# A tibble: 15,000 × 3
id date_onset gender
<int> <IDate> <chr>
1 14905 2015-03-15 male
2 13043 2013-09-11 female
3 14364 2014-02-09 female
4 14675 2014-10-19 <NA>
5 12648 2014-06-08 female
6 14274 2015-04-05 female
7 14132 NA male
8 14715 NA female
9 13435 2014-07-09 male
10 14816 2015-06-29 female
# ℹ 14,990 more rows
This allows for the use of tagged variables only in downstream analysis, which will be useful for the next episode (Aggregate and visualize)!
Get a one chunk version of all the steps learned in this episode in the spoiler below.
You can do all these steps connected in a single pipe:
R
# use cleaned data
cleaned_data %>%
# tag as many variables as possible
# creates the <linelist> class object
linelist::make_linelist(
id = "case_id",
date_onset = "date_onset",
gender = "gender"
) %>%
# validate the linelist
linelist::validate_linelist() %>%
# extract a df with standard column names
linelist::tags_df()
OUTPUT
'.' is a valid linelist object
OUTPUT
# A tibble: 15,000 × 3
id date_onset gender
<int> <IDate> <chr>
1 14905 2015-03-15 male
2 13043 2013-09-11 female
3 14364 2014-02-09 female
4 14675 2014-10-19 <NA>
5 12648 2014-06-08 female
6 14274 2015-04-05 female
7 14132 NA male
8 14715 NA female
9 13435 2014-07-09 male
10 14816 2015-06-29 female
# ℹ 14,990 more rows
When should I use
{linelist}?
Data analysis during an outbreak response or mass-gathering surveillance demands a different set of data safeguards if compared to usual research situations. For example, your data will change or be updated over time (e.g., new entries, new variables, renamed variables).
linelist is more appropriate for this type of ongoing
or long-lasting analysis. Check the “Get started” vignette section about
When
I should consider using {linelist}? for more
information.
- Use the linelist package to tag, validate, and prepare case data for downstream analysis.
- Explore and map dataset variables to predefined tags for standardization.
- Understand how
Warningsvs.Errorsaffect the data processing workflow.