Validate case data
Last updated on 2024-11-14 | Edit this page
Overview
Questions
- How to convert raw dataset into a
linelist
object?
Objectives
- Demonstrate how to covert case data to
linelist
data - Demonstrate how to validate data
Prerequisite
This episode requires you to:
- Download the cleaned_data.csv
- Save it in the
data/
folder.
Introduction
In outbreak analysis, once you have completed the initial steps of reading and cleaning the case data, it’s essential to establish an additional foundation layer to ensure the integrity and reliability of subsequent analyses. Specifically, this involves verifying the presence and correct data type of certain columns within your dataset, a process commonly referred to as “tagging.” Additionally. it’s crucial to implement measures to validate that these tagged columns are not inadvertently deleted during further data processing steps.
This episode focuses tagging and validate outbreak data using the linelist package.
Let’s start by loading the package rio to read data and
the package linelist to create a linelist. We’ll use the
pipe %>%
to connect some of their functions, including
others from the package dplyr, so let’s also call to the
tidyverse package:
R
# Load packages
library(tidyverse) # for {dplyr} functions and the pipe %>%
library(rio) # for importing data
library(here) # for easy file referencing
library(linelist) # for taggin and validating
The double-colon
The double-colon ::
in R let you call a specific
function from a package without loading the entire package into the
current environment.
For example, dplyr::filter(data, condition)
uses
filter()
from the dplyr package.
This help us remember package functions and avoid namespace conflicts.
Import the dataset following the guidelines outlined in the Read case data episode. This involves loading the dataset into the working environment and view its structure and content.
R
# Read data
# e.g.: if path to file is data/simulated_ebola_2.csv then:
cleaned_data <- rio::import(
here::here("data", "cleaned_data.csv")
) %>%
dplyr::as_tibble() # for a simple data frame output
OUTPUT
# A tibble: 15,000 × 10
v1 case_id age gender status date_onset date_sample row_id
<int> <int> <dbl> <chr> <chr> <IDate> <IDate> <int>
1 1 14905 90 male confirmed 2015-03-15 2015-06-04 1
2 2 13043 25 female <NA> 2013-09-11 2014-03-01 2
3 3 14364 54 female <NA> 2014-09-02 2015-03-03 3
4 4 14675 90 <NA> <NA> 2014-10-19 2031-12-14 4
5 5 12648 74 female <NA> 2014-08-06 2016-10-10 5
6 6 14274 76 female <NA> 2015-04-05 2016-01-23 7
7 7 14132 16 male confirmed NA 2015-05-10 8
8 8 14715 44 female confirmed NA 2016-04-24 9
9 9 13435 26 male <NA> 2014-09-07 2020-09-14 10
10 10 14816 30 female <NA> 2015-06-29 2015-06-02 11
# ℹ 14,990 more rows
# ℹ 2 more variables: years_since_collection <int>, remainder_months <int>
Discussion
An unexpected change
You are in an emergency response situation. You need to generate daily situation reports. You automated your analysis to read data directly from the online server 😁. However, the people in charge of the data collection/administration needed to remove/rename/reformat one variable you found helpful 😞!
How can you detect if the data input is still valid to replicate the analysis code you wrote the day before?
Creating a linelist and tagging elements
Then we convert the cleaned case data into a linelist
object using linelist package, see the below code
chunk.
R
# Create a linelist object from cleaned data
linelist_data <- linelist::make_linelist(
x = cleaned_data, # Input data
id = "case_id", # Column for unique case identifiers
date_onset = "date_onset", # Column for date of symptom onset
gender = "gender" # Column for gender
)
# Display the resulting linelist object
linelist_data
OUTPUT
// linelist object
# A tibble: 15,000 × 10
v1 case_id age gender status date_onset date_sample row_id
<int> <int> <dbl> <chr> <chr> <IDate> <IDate> <int>
1 1 14905 90 male confirmed 2015-03-15 2015-06-04 1
2 2 13043 25 female <NA> 2013-09-11 2014-03-01 2
3 3 14364 54 female <NA> 2014-09-02 2015-03-03 3
4 4 14675 90 <NA> <NA> 2014-10-19 2031-12-14 4
5 5 12648 74 female <NA> 2014-08-06 2016-10-10 5
6 6 14274 76 female <NA> 2015-04-05 2016-01-23 7
7 7 14132 16 male confirmed NA 2015-05-10 8
8 8 14715 44 female confirmed NA 2016-04-24 9
9 9 13435 26 male <NA> 2014-09-07 2020-09-14 10
10 10 14816 30 female <NA> 2015-06-29 2015-06-02 11
# ℹ 14,990 more rows
# ℹ 2 more variables: years_since_collection <int>, remainder_months <int>
// tags: id:case_id, date_onset:date_onset, gender:gender
The linelist package supplies tags for common
epidemiological variables and a set of appropriate data types for each.
You can view the list of available tags by the variable name and their
acceptable data types for each using
linelist::tags_types()
.
Challenge
Let’s tag more variables. In new datasets, it will be frequent to have variable names different to the available tag names. However, we can associate them based on how variables were defined for data collection.
Now:
- Explore the available tag names in {linelist}.
- Find what other variables in the cleaned dataset can be associated with any of these available tags.
-
Tag those variables as above using
linelist::make_linelist()
.
Your can get access to the list of available tag names in {linelist} using:
R
# Get a list of available tags by name and data types
linelist::tags_types()
# Get a list of names only
linelist::tags_names()
R
linelist::make_linelist(
x = cleaned_data,
id = "case_id",
date_onset = "date_onset",
gender = "gender",
age = "age", # same name in default list and dataset
date_reporting = "date_sample" # different names but related
)
How these additional tags are visible in the output?
Validation
To ensure that all tagged variables are standardized and have the
correct data types, use the linelist::validate_linelist()
,
as shown in the example below:
R
linelist::validate_linelist(linelist_data)
Challenge
Let’s validate tagged variables. Let’s simulate that in an ongoing outbreak; the next day, your data has a new set of entries (i.e., rows or observations) but one variable change of data type.
For example, let’s make the variable age
change of type
from a double (<dbl>
) variable to character
(<chr>
).
To simulate it:
- Change the variable data type,
- Tag the variable into a linelist, and then
- Validate it.
Describe how linelist::validate_linelist()
reacts when
input data has a different variable data type.
We can use dplyr::mutate()
to change the variable type
before tagging for validation. For example:
R
cleaned_data %>%
# simulate a change of data type in one variable
dplyr::mutate(age = as.character(age)) %>%
# tag one variable
linelist::... %>%
# validate the linelist
linelist::...
Please run the code line by line, focusing only on the parts before the pipe (
%>%
). After each step, observe the output before moving to the next line.
If the age
variable changes from double
(<dbl>
) to character (<chr>
) we
get the following:
R
cleaned_data %>%
# simulate a change of data type in one variable
dplyr::mutate(age = as.character(age)) %>%
# tag one variable
linelist::make_linelist(
age = "age"
) %>%
# validate the linelist
linelist::validate_linelist()
ERROR
Error: Some tags have the wrong class:
- age: Must inherit from class 'numeric'/'integer', but has class 'character'
Why are we getting an Error
message?
Explore other situations to understand this behavior. Let’s try these additional changes to variables:
-
date_onset
changes from a<date>
variable to character (<chr>
), -
gender
changes from a character (<chr>
) variable to integer (<int>
).
Then tag them into a linelist for validation. Does the
Error
message propose to us the solution?
R
# Change 2
# Run this code line by line to identify changes
cleaned_data %>%
# simulate a change of data type
dplyr::mutate(date_onset = as.character(date_onset)) %>%
# tag
linelist::make_linelist(
date_onset = "date_onset"
) %>%
# validate
linelist::validate_linelist()
R
# Change 3
# Run this code line by line to identify changes
cleaned_data %>%
# simulate a change of data type
dplyr::mutate(gender = as.factor(gender)) %>%
dplyr::mutate(gender = as.integer(gender)) %>%
# tag
linelist::make_linelist(
gender = "gender"
) %>%
# validate
linelist::validate_linelist()
We get Error
messages because of the mismatch between
the predefined tag type (from linelist::tags_types()
) and
the tagged variable class in the linelist.
The Error
message inform us that in order to
validate our linelist, we must fix the input variable
type to fit the expected tag type. In a data analysis script, we can do
this by adding one cleaning step into the pipeline.
Challenge
What step along the linelist workflow of tagging and validating would response to the absence of a variable?
About losing variables, you can simulate this scenario:
R
cleaned_data %>%
# simulate a change of data type in one variable
select(-age) %>%
# tag one variable
linelist::make_linelist(
age = "age"
)
ERROR
Error in base::tryCatch(base::withCallingHandlers({: 1 assertions failed:
* Variable 'tag': Must be element of set
* {'v1','case_id','gender','status','date_onset','date_sample','row_id','years_since_collection','remainder_months'},
* but is 'age'.
Safeguarding
Safeguarding is implicitly built into the linelist objects. If you try to drop any of the tagged columns, you will receive an error or warning message, as shown in the example below.
R
new_df <- linelist_data %>%
dplyr::select(case_id, gender)
WARNING
Warning: The following tags have lost their variable:
date_onset:date_onset
This Warning
message above is the default output option
when we lose tags in a linelist
object. However, it can be
changed to an Error
message using
linelist::lost_tags_action()
.
Challenge
Let’s test the implications of changing the
safeguarding configuration from a Warning
to an Error
message.
- First, run this code to count the frequency per category within a categorical variable:
R
linelist_data %>%
dplyr::select(case_id, gender) %>%
dplyr::count(gender)
- Set behavior for lost tags in a
linelist
to “error” as follows:
R
# set behavior to "error"
linelist::lost_tags_action(action = "error")
- Now, re-run the above code segment with
dplyr::count()
.
Identify:
- What is the difference in the output between a
Warning
and anError
? - What could be the implications of this change for your daily data analysis pipeline during an outbreak response?
Deciding between Warning
or Error
message
will depend on the level of attention or flexibility you need when
losing tags. One will alert you about a change but will continue running
the code downstream. The other will stop your analysis pipeline and the
rest will not be executed.
A data reading, cleaning and validation script may require a more stable or fixed pipeline. An exploratory data analysis may require a more flexible approach. These two processes can be isolated in different scripts or repositories to adjust the safeguarding according to your needs.
Before you continue, set the configuration back again to the default
option of Warning
:
R
# set behavior to the default option: "warning"
linelist::lost_tags_action()
OUTPUT
Lost tags will now issue a warning.
A linelist
object resembles a data frame but offers
richer features and functionalities. Packages that are linelist-aware
can leverage these features. For example, you can extract a data frame
of only the tagged columns using the linelist::tags_df()
function, as shown below:
R
linelist::tags_df(linelist_data)
OUTPUT
# A tibble: 15,000 × 3
id date_onset gender
<int> <IDate> <chr>
1 14905 2015-03-15 male
2 13043 2013-09-11 female
3 14364 2014-09-02 female
4 14675 2014-10-19 <NA>
5 12648 2014-08-06 female
6 14274 2015-04-05 female
7 14132 NA male
8 14715 NA female
9 13435 2014-09-07 male
10 14816 2015-06-29 female
# ℹ 14,990 more rows
This allows, the extraction of use tagged-only columns in downstream analysis, which will be useful for the next episode!
When should I use
{linelist}
?
Data analysis during an outbreak response or mass-gathering surveillance demands a different set of “data safeguards” if compared to usual research situations. For example, your data will change or be updated over time (e.g. new entries, new variables, renamed variables).
linelist is more appropriate for this type of ongoing or long-lasting analysis. Check the “Get started” vignette section about When you should consider using {linelist}? for more information.
Key Points
- Use linelist package to tag, validate, and prepare case data for downstream analysis.