Validate case data
Last updated on 2025-03-24 | Edit this page
- How to convert a raw dataset into a
- Demonstrate how to covert case data to
data - Demonstrate how to tag and validate data to make analysis more reliable
This episode requires you to:
- Download the cleaned_data.csv
- Save it in the
In outbreak analysis, once you have completed the initial steps of
reading and cleaning the case data, it’s essential to establish an
additional foundation layer to ensure the integrity and reliability of
subsequent analyses. Otherwise you might find that your analysis
suddenly stops working when specific variables appear or disappear, or
their underlying data types (like <date>
) change. Specifically, this additional layer
involves: 1) verifying the presence and correct data type of certain
columns within your dataset, a process commonly referred to as
“tagging”; 2) implementing measures to check that these tagged columns
are not inadvertently deleted during further data processing steps,
known as “validation”.
This episode focuses tagging and validate outbreak data using the linelist package.
Let’s start by loading the package rio to read data and
the package linelist to create a linelist object. We’ll
use the pipe %>%
to connect some of their functions,
including others from the package dplyr, so let’s also
call to the tidyverse package:
# Load packages
library(tidyverse) # for {dplyr} functions and the pipe %>%
library(rio) # for importing data
library(here) # for easy file referencing
library(linelist) # for taggin and validating
The double-colon
The double-colon ::
in R lets you call a specific
function from a package without loading the entire package into the
current environment.
For example, dplyr::filter(data, condition)
from the dplyr package.
This help us remember package functions and avoid namespace conflicts.
Import the dataset following the guidelines outlined in the Read case data episode. This involves loading the dataset into the working environment and view its structure and content.
# Read data
# e.g.: if path to file is data/simulated_ebola_2.csv then:
cleaned_data <- rio::import(
here::here("data", "cleaned_data.csv")
) %>%
dplyr::as_tibble() # for a simple data frame output
# A tibble: 15,000 × 10
v1 case_id age gender status date_onset date_sample row_id
<int> <int> <dbl> <chr> <chr> <IDate> <IDate> <int>
1 1 14905 90 male confirmed 2015-03-15 2015-06-04 1
2 2 13043 25 female <NA> 2013-09-11 2014-03-01 2
3 3 14364 54 female <NA> 2014-09-02 2015-03-03 3
4 4 14675 90 <NA> <NA> 2014-10-19 2031-12-14 4
5 5 12648 74 female <NA> 2014-08-06 2016-10-10 5
6 6 14274 76 female <NA> 2015-04-05 2016-01-23 7
7 7 14132 16 male confirmed NA 2015-05-10 8
8 8 14715 44 female confirmed NA 2016-04-24 9
9 9 13435 26 male <NA> 2014-09-07 2020-09-14 10
10 10 14816 30 female <NA> 2015-06-29 2015-06-02 11
# ℹ 14,990 more rows
# ℹ 2 more variables: years_since_collection <int>, remainder_months <int>
An unexpected change
You are in an emergency response situation. You need to generate daily situation reports. You automated your analysis to read data directly from the online server 😁. However, the people in charge of the data collection/administration needed to remove/rename/reformat one variable you found helpful 😞!
How can you detect if the data input is still valid to replicate the analysis code you wrote the day before?
Creating a linelist and tagging elements
Once the data is loaded and cleaned, we convert the cleaned case data
into a linelist
object using linelist
package, as in the below code chunk.
# Create a linelist object from cleaned data
linelist_data <- linelist::make_linelist(
x = cleaned_data, # Input data
id = "case_id", # Column for unique case identifiers
date_onset = "date_onset", # Column for date of symptom onset
gender = "gender" # Column for gender
# Display the resulting linelist object
// linelist object
# A tibble: 15,000 × 10
v1 case_id age gender status date_onset date_sample row_id
<int> <int> <dbl> <chr> <chr> <IDate> <IDate> <int>
1 1 14905 90 male confirmed 2015-03-15 2015-06-04 1
2 2 13043 25 female <NA> 2013-09-11 2014-03-01 2
3 3 14364 54 female <NA> 2014-09-02 2015-03-03 3
4 4 14675 90 <NA> <NA> 2014-10-19 2031-12-14 4
5 5 12648 74 female <NA> 2014-08-06 2016-10-10 5
6 6 14274 76 female <NA> 2015-04-05 2016-01-23 7
7 7 14132 16 male confirmed NA 2015-05-10 8
8 8 14715 44 female confirmed NA 2016-04-24 9
9 9 13435 26 male <NA> 2014-09-07 2020-09-14 10
10 10 14816 30 female <NA> 2015-06-29 2015-06-02 11
# ℹ 14,990 more rows
# ℹ 2 more variables: years_since_collection <int>, remainder_months <int>
// tags: id:case_id, date_onset:date_onset, gender:gender
The linelist package supplies tags for common
epidemiological variables and a set of appropriate data types for each.
You can view the list of available tags by the variable name and their
acceptable data types for each using
Let’s tag more variables. In new datasets, it will be frequent to have variable names different to the available tag names. However, we can associate them based on how variables were defined for data collection.
- Explore the available tag names in {linelist}.
- Find what other variables in the cleaned dataset can be associated with any of these available tags.
Tag those variables as above using
Your can get access to the list of available tag names in {linelist} using:
# Get a list of available tags by name and data types
# Get a list of names only
x = cleaned_data,
id = "case_id",
date_onset = "date_onset",
gender = "gender",
age = "age", # same name in default list and dataset
date_reporting = "date_sample" # different names but related
How these additional tags are visible in the output?
To ensure that all tagged variables are standardized and have the
correct data types, use the linelist::validate_linelist()
as shown in the example below:
Let’s validate some tagged variables. Let’s simulate a situation in an ongoing outbreak. You wake up one day to discover that the data stream you have rely on has a new set of entries (i.e., rows or observations) and one variable that has a change of data type.
For example, let’s assume the variable age
changed from
a double (<dbl>
) variable to character
To simulate this situation:
- Change the variable data type,
- Tag the variable into a linelist, and then
- Validate it.
Describe how linelist::validate_linelist()
reacts when
input data has a different variable data type.
We can use dplyr::mutate()
to change the variable type
before tagging for validation. For example:
cleaned_data %>%
# simulate a change of data type in one variable
dplyr::mutate(age = as.character(age)) %>%
# tag one variable
linelist::... %>%
# validate the linelist
Please run the code line by line, focusing only on the parts before the pipe (
). After each step, observe the output before moving to the next line.
If the age
variable changes from double
) to character (<chr>
) we
get the following:
cleaned_data %>%
# simulate a change of data type in one variable
dplyr::mutate(age = as.character(age)) %>%
# tag one variable
age = "age"
) %>%
# validate the linelist
Error: Some tags have the wrong class:
- age: Must inherit from class 'numeric'/'integer', but has class 'character'
Why are we getting an Error
Explore other situations to understand this behavior. Let’s try these additional changes to variables:
changes from a<date>
variable to character (<chr>
), -
changes from a character (<chr>
) variable to integer (<int>
Then tag them into a linelist for validation. Does the
message propose to us the solution?
# Change 2
# Run this code line by line to identify changes
cleaned_data %>%
# simulate a change of data type
dplyr::mutate(date_onset = as.character(date_onset)) %>%
# tag
date_onset = "date_onset"
) %>%
# validate
# Change 3
# Run this code line by line to identify changes
cleaned_data %>%
# simulate a change of data type
dplyr::mutate(gender = as.factor(gender)) %>%
dplyr::mutate(gender = as.integer(gender)) %>%
# tag
gender = "gender"
) %>%
# validate
We get Error
messages because of the mismatch between
the predefined tag type (from linelist::tags_types()
) and
the tagged variable class in the linelist.
The Error
message inform us that in order to
validate our linelist, we must fix the input variable
type to fit the expected tag type. In a data analysis script, we can do
this by adding one cleaning step into the pipeline.
What step along the linelist workflow of tagging and validating would response to the absence of a variable?
About losing variables, you can simulate this scenario:
cleaned_data %>%
# simulate a change of data type in one variable
select(-age) %>%
# tag one variable
age = "age"
Error in base::tryCatch(base::withCallingHandlers({: 1 assertions failed:
* Variable 'tag': Must be element of set
* {'v1','case_id','gender','status','date_onset','date_sample','row_id','years_since_collection','remainder_months'},
* but is 'age'.
Safeguarding is implicitly built into the linelist objects. If you try to drop any of the tagged columns, you will receive an error or warning message, as shown in the example below.
new_df <- linelist_data %>%
dplyr::select(case_id, gender)
Warning: The following tags have lost their variable:
This Warning
message above is the default output option
when we lose tags in a linelist
object. However, it can be
changed to an Error
message using
Let’s test the implications of changing the
safeguarding configuration from a Warning
to an Error
- First, run this code to count the frequency per category within a categorical variable:
linelist_data %>%
dplyr::select(case_id, gender) %>%
- Set behavior for lost tags in a
to “error” as follows:
# set behavior to "error"
linelist::lost_tags_action(action = "error")
- Now, re-run the above code segment with
- What is the difference in the output between a
and anError
? - What could be the implications of this change for your daily data analysis pipeline during an outbreak response?
Deciding between Warning
or Error
will depend on the level of attention or flexibility you need when
losing tags. One will alert you about a change but will continue running
the code downstream. The other will stop your analysis pipeline and the
rest will not be executed.
A data reading, cleaning and validation script may require a more stable or fixed pipeline. An exploratory data analysis may require a more flexible approach. These two processes can be isolated in different scripts or repositories to adjust the safeguarding according to your needs.
Before you continue, set the configuration back again to the default
option of Warning
# set behavior to the default option: "warning"
Lost tags will now issue a warning.
A linelist
object resembles a data frame but offers
richer features and functionalities. Packages that are linelist-aware
can leverage these features. For example, you can extract a data frame
of only the tagged columns using the linelist::tags_df()
function, as shown below:
# A tibble: 15,000 × 3
id date_onset gender
<int> <IDate> <chr>
1 14905 2015-03-15 male
2 13043 2013-09-11 female
3 14364 2014-09-02 female
4 14675 2014-10-19 <NA>
5 12648 2014-08-06 female
6 14274 2015-04-05 female
7 14132 NA male
8 14715 NA female
9 13435 2014-09-07 male
10 14816 2015-06-29 female
# ℹ 14,990 more rows
This allows, the extraction of use tagged-only columns in downstream analysis, which will be useful for the next episode!
When should I use
Data analysis during an outbreak response or mass-gathering surveillance demands a different set of “data safeguards” if compared to usual research situations. For example, your data will change or be updated over time (e.g. new entries, new variables, renamed variables).
linelist is more appropriate for this type of ongoing or long-lasting analysis. Check the “Get started” vignette section about When you should consider using {linelist}? for more information.
Key Points
- Use linelist package to tag, validate, and prepare case data for downstream analysis.