Validate case data

Last updated on 2026-02-24 | Edit this page

Estimated time: 12 minutes

Overview

Questions

  • How to convert a raw dataset into a linelist object?

Objectives

  • Demonstrate how to covert case data into linelist data
  • Demonstrate how to tag and validate data to make analysis more reliable

Prerequisite

This episode requires you to:

Introduction


In outbreak analysis, once you have completed the initial steps of reading and cleaning the case data, it’s essential to establish an additional fundamental layer to ensure the integrity and reliability of subsequent analyses. Otherwise you might encounter issues during the analysis process due to creation or removal of specific variables, changes in their underlying data types (like <date> or <chr>), etc. Specifically, this additional step involves:

  1. Verifying the presence and correct data type of certain columns within your dataset, a process commonly referred to as tagging;
  2. Implementing measures to make sure that these tagged columns are not inadvertently deleted during further data processing steps, known as validation.

This episode focuses on tagging and validating outbreak data using the linelist package. Let’s start by loading the package rio to read data and the linelist package to create a linelist object. We’ll use the pipe operator (%>%) to connect some of their functions, including others from the package dplyr. For this reason, we will also load the {tidyverse} package.

R

# Load packages
library(tidyverse) # to access {dplyr} functions and the pipe %>% operator
# from {magrittr}
library(rio) # for importing data
library(here) # for easy file referencing
library(linelist) # for tagging and validating

The double-colon (::) operator

The::in R lets you access functions or objects from a specific package without attaching the entire package to the search path. It offers several important advantages including the followings:

  • Telling explicitly which package a function comes from, reducing ambiguity and potential conflicts when several packages have functions with the same name.
  • Allowing to call a function from a package without loading the whole package with library().

For example, the command dplyr::filter(data, condition) means we are calling the filter() function from the dplyr package.

Import the dataset following the guidelines outlined in the Read case data episode. This involves loading the dataset into the working environment and view its structure and content.

R

# Read data
# e.g.: if path to file is data/simulated_ebola_2.csv then:
cleaned_data <- rio::import(
  here::here("data", "cleaned_data.csv")
) %>%
  dplyr::as_tibble() # for a simple data frame output

OUTPUT

# A tibble: 15,000 × 9
      v1 case_id   age gender status    date_onset date_sample
   <int>   <int> <dbl> <chr>  <chr>     <IDate>    <IDate>
 1     1   14905    90 male   confirmed 2015-03-15 2015-04-06
 2     2   13043    25 female <NA>      2013-09-11 2014-01-03
 3     3   14364    54 female <NA>      2014-02-09 2015-03-03
 4     4   14675    90 <NA>   <NA>      2014-10-19 2014-12-31
 5     5   12648    74 female <NA>      2014-06-08 2016-10-10
 6     6   14274    76 female <NA>      2015-04-05 2016-01-23
 7     7   14132    16 male   confirmed NA         2015-10-05
 8     8   14715    44 female confirmed NA         2016-04-24
 9     9   13435    26 male   <NA>      2014-07-09 2014-09-20
10    10   14816    30 female <NA>      2015-06-29 2015-02-06
# ℹ 14,990 more rows
# ℹ 2 more variables: years_since_collection <int>, remainder_months <int>

Discussion

An unexpected change

You are in an emergency response situation. You need to generate daily situation reports. You automated your analysis to read data directly from the online server 😁. However, the people in charge of the data collection/administration needed to remove/rename/reformat one variable you found helpful 😞!

How can you detect if the input data is still valid to replicate the analysis code you wrote the day before?

If learners do not have an experience to share, we as instructors can share one.

A scenario like this usually happens when the institution doing the analysis is not the same as the institution collecting the data. The later can make decisions about the data structure that can affect downstream processes, impacting the time or the accuracy of the analysis results.

Creating a linelist and tagging columns


Once the data is loaded and cleaned, we can convert the cleaned case data into a linelist object using linelist package, as in the below code chunk.

R

# Create a linelist object from cleaned data
linelist_data <- linelist::make_linelist(
  x = cleaned_data,         # Input data
  id = "case_id",            # Column for unique case identifiers
  date_onset = "date_onset", # Column for date of symptom onset
  gender = "gender"          # Column for gender
)

# Display the resulting linelist object
linelist_data

OUTPUT


// linelist object
# A tibble: 15,000 × 9
      v1 case_id   age gender status    date_onset date_sample
   <int>   <int> <dbl> <chr>  <chr>     <IDate>    <IDate>
 1     1   14905    90 male   confirmed 2015-03-15 2015-04-06
 2     2   13043    25 female <NA>      2013-09-11 2014-01-03
 3     3   14364    54 female <NA>      2014-02-09 2015-03-03
 4     4   14675    90 <NA>   <NA>      2014-10-19 2014-12-31
 5     5   12648    74 female <NA>      2014-06-08 2016-10-10
 6     6   14274    76 female <NA>      2015-04-05 2016-01-23
 7     7   14132    16 male   confirmed NA         2015-10-05
 8     8   14715    44 female confirmed NA         2016-04-24
 9     9   13435    26 male   <NA>      2014-07-09 2014-09-20
10    10   14816    30 female <NA>      2015-06-29 2015-02-06
# ℹ 14,990 more rows
# ℹ 2 more variables: years_since_collection <int>, remainder_months <int>

// tags: id:case_id, date_onset:date_onset, gender:gender 

The linelist package supplies tags for common epidemiological variables and a set of appropriate data types for each. You can view the list of available tags by the variable name and their acceptable data types using the linelist::tags_types() function.

Challenge

Let’s tag more variables. In some datasets, it is possible to encounter variable names that are different from the available tag names. In such cases, we can associate them based on how variables were defined for data collection.

Now: -Explore the available tag names in linelist. -Find what other variables in the input dataset can be associated with any of these available tags. -Tag those variables as shown above using the linelist::make_linelist() function.

Your can get access to the list of available tag names in linelist using:

R

# Get a list of available tags names and data types
linelist::tags_types()

# Get a list of names only
linelist::tags_names()

R

linelist::make_linelist(
  x = cleaned_data,
  id = "case_id",
  date_onset = "date_onset",
  gender = "gender",
  age = "age",
  # same name in default list and dataset
  date_reporting = "date_sample" # different names but related
)

Are these additional tags visible in the output?

< !–Do you want to see a display of available and tagged variables? You can explore the function linelist::tags() and read its reference documentation.- ->

Validation


To ensure that all tagged variables are standardized and have the correct data types, use the linelist::validate_linelist() function, as shown in the example below:

R

linelist::validate_linelist(linelist_data)

OUTPUT

'linelist_data' is a valid linelist object

If your dataset requires a new tag other than those defined in the linelist package, use allow_extra = TRUE when creating the linelist object with its corresponding datatype using the linelist::make_linelist() function.

Challenge

Let’s assume the following scenario during an ongoing outbreak. You notice at some point that the data stream you have been relying on has a set of new entries (i.e., rows or observations), and the data type of one variable has changed.

Let’s consider the example where the type age variable has changed from a double (<dbl>) to character (<chr>).

To simulate this situation:

  • Change the data type of the variable,

  • Tag the variable into a linelist, and then

  • Validate it.

Describe how linelist::validate_linelist() reacts when there is a change in the data type of one variable of the input data.

We can use dplyr::mutate() to change the variable type before tagging for validation. For example:

R

# nolint start

cleaned_data %>%
  # simulate a change of data type in one variable
  dplyr::mutate(age = as.character(age)) %>%
  # tag one variable
  linelist::.... %>%
  # validate the linelist
  linelist::...

# nolint end

Please run the code line by line, focusing only on the parts before the pipe (%>%). After each step, observe the output before moving to the next line.

R

cleaned_data %>%
  # simulate a change of data type in one variable
  dplyr::mutate(age = as.character(age)) %>%
  # tag one variable
  linelist::make_linelist(age = "age") %>%
  # validate the linelist
  linelist::validate_linelist()

ERROR

Error:
! Some tags have the wrong class:
  - age: Must inherit from class 'numeric'/'integer', but has class 'character'

Challenge (continued)

Why are we getting an Error message?

Should we have a Warning message instead? Explain why.

Explore other situations to understand this behavior by converting:-date_onset from <date> to character (<chr>), -gender character (<chr>) to integer (<int>).

Then tag them into a linelist for validation. Does the Error message suggest a fix to the issue?

Why are we getting an Error message? Should we have a Warning message instead? Explain why? Explore other situations to understand this behavior by converting:-date_onset from <date> to character (<chr>), -gender character (<chr>) to integer (<int>).

Then tag them into a linelist for validation. Does the Error message suggest a fix to the issue?

R

# Change 2
# Run this code line by line to identify changes
cleaned_data %>%
  # simulate a change of data type
  dplyr::mutate(date_onset = as.character(date_onset)) %>%
  # tag
  linelist::make_linelist(date_onset = "date_onset") %>%
  # validate
  linelist::validate_linelist()

R

# Change 3
# Run this code line by line to identify changes
cleaned_data %>%
  # simulate a change of data type
  dplyr::mutate(gender = as.factor(gender)) %>%
  dplyr::mutate(gender = as.integer(gender)) %>%
  # tag
  linelist::make_linelist(gender = "gender") %>%
  # validate
  linelist::validate_linelist()

We get Error messages because the default type of these variable in linelist::tags_types() is different from the one we set them at.

The Error message inform us that in order to validate our linelist, we must fix the input variable type to fit the expected tag type. In a data analysis script, we can do this by adding one cleaning step into the pipeline.

Challenge (continued)

Challenge

Beyond tagging and validating the linelist object, what extra step do we needed when building the object?

Let’s simulate a scenario about losing a variable :

R

cleaned_data %>%
  # remove the variable 'age'
  select(-age) %>%
  # tag variable 'age' that no longer exist
  linelist::make_linelist(
    age = "age"
  )

ERROR

Error in `base::tryCatch()`:
! 1 assertions failed:
 * Variable 'tag': Must be element of set
 * {'v1','case_id','gender','status','date_onset','date_sample','years_since_collection','remainder_months'},
 * but is 'age'.

Safeguarding


Safeguarding is implicitly built into the linelist objects. If you try to drop any of the tagged columns, you will receive an error or warning message, as shown in the example below.

R

new_df <- linelist_data %>%
  dplyr::select(case_id, gender)

WARNING

Warning: The following tags have lost their variable:
 date_onset:date_onset

This Warning message above is the default output option when we lose tags in a linelist object. However, it can be changed to an Error message using the linelist::lost_tags_action() function.

Challenge

Let’s test the implications of changing the safeguarding configuration from a Warning to an Error message.

  • First, run this code to count the frequency of each category within a categorical variable:

R

linelist_data %>%
  dplyr::select(case_id, gender) %>%
  dplyr::count(gender)
  • Set the behavior for lost tags in a linelist to “error” as follows:

R

# set behavior to "error"
linelist::lost_tags_action(action = "error")
  • Now, re - run the above code chunk with dplyr::count().

Identify:

  • What is the difference in the output between a Warning and an Error?

  • What could be the implications of this change for your daily data analysis pipeline during an outbreak response?

Deciding between Warning or Error message will depend on the level of attention or flexibility you need when losing tags. One will alert you about a change but will continue running the code downstream. The other will stop your analysis pipeline and the rest will not be executed.

A data reading, cleaning and validation script may require a more stable or fixed pipeline. An exploratory data analysis may require a more flexible approach. These two processes can be isolated in different scripts or repositories to adjust the safeguarding according to your needs.

Before you continue, set the configuration back again to the default option of Warning:

R

# set behavior to the default option: "warning"
linelist::lost_tags_action()

OUTPUT

Lost tags will now issue a warning.

A linelist object resembles a data frame but offers richer features and functionalities. Packages that are linelist - aware can leverage these features. For example, you can extract a data frame of only the tagged columns using the linelist::tags_df() function, as shown below:

R

linelist::tags_df(linelist_data)

OUTPUT

# A tibble: 15,000 × 3
      id date_onset gender
   <int> <IDate>    <chr>
 1 14905 2015-03-15 male
 2 13043 2013-09-11 female
 3 14364 2014-02-09 female
 4 14675 2014-10-19 <NA>
 5 12648 2014-06-08 female
 6 14274 2015-04-05 female
 7 14132 NA         male
 8 14715 NA         female
 9 13435 2014-07-09 male
10 14816 2015-06-29 female
# ℹ 14,990 more rows

This allows for the use of tagged variables only in downstream analysis, which will be useful for the next episode!

When should I use {linelist}?

Data analysis during an outbreak response or mass - gathering surveillance demands a different set of “data safeguards” if compared to usual research situations. For example, your data will change or be updated over time (e.g. new entries, new variables, renamed variables).

linelist is more appropriate for this type of ongoing or long - lasting analysis. Check the “Get started” vignette section about When I should consider using {linelist}? for more information.

Key Points

  • Use the linelist package to tag, validate, and prepare case data for downstream analysis.