Validate case data

Last updated on 2026-07-23 | Edit this page

Overview

Questions

How can a raw case data be converted into a linelist object?

Objectives

Demonstrate how to convert case data into linelist data
Demonstrate how to tag and validate data to make analysis more reliable

Prerequisite

This episode requires you to:

Download the cleaned_data.csv file
Save it in the data/ folder

Introduction

In outbreak analysis, once you have completed the initial steps of reading and cleaning the case data, it’s essential to establish an additional fundamental layer to ensure the integrity and reliability of subsequent analyses. Without this step, you may encounter issues later, for example, variables may be be unintentionally modified or removed, or their data types (like <Date> or <character>), may change during processing. This additional layer typically involves two key steps:

tagging: Verifying that required columns are present in the dataset and confirming that they have the correct data types.
validation: Implementing safeguards to ensure that tagged columns are not accidentally deleted or altered during subsequent data manipulation steps.

This episode focuses on creating linelist object using the linelist package, which natively supports tagging and validating outbreak data o ensure data integrity throughout the analysis workflow. Let’s start by loading the package rio to read data and the linelist package to create a linelist object. We’ll use the pipe operator (%>%) to connect some of their functions, including others from the package dplyr. For this reason, we will also load the {tidyverse} package.

R

# Load packages
library(tidyverse) # fo {dplyr} functions and the pipe %>% operator
library(rio) # for importing data
library(here) # for easy file referencing
library(linelist) # for tagging and validating

Checklist

The double-colon (`::`) operator

The :: in R lets you access functions or objects from a specific package without attaching the entire package to the search path. It offers several important advantages, including the following:

Telling explicitly which package a function comes from, reducing ambiguity and potential conflicts when several packages have functions with the same name
Allowing you to call a function from a package without loading the whole package with library()

For example, the command dplyr::filter(data, condition) means we are calling the filter() function from the dplyr package.

Import the dataset following the guidelines outlined in the Read case data episode. This involves loading the dataset into the working environment and viewing its structure and content.

R

# Read data
# e.g., if path to file is data/cleaned_data.csv then:
cleaned_data <- rio::import(
  here::here("data", "cleaned_data.csv")
) %>%
  dplyr::as_tibble() # for a simple data frame output

OUTPUT

# A tibble: 15,000 × 8
      v1 case_id   age gender status    date_onset date_sample reporting_delay
   <int>   <int> <dbl> <chr>  <chr>     <IDate>    <IDate>               <int>
 1     1   14905    90 male   confirmed 2015-03-15 2015-04-06               22
 2     2   13043    25 female <NA>      2013-09-11 2014-01-03              114
 3     3   14364    54 female <NA>      2014-02-09 2015-03-03              387
 4     4   14675    90 <NA>   <NA>      2014-10-19 2014-12-31               73
 5     5   12648    74 female <NA>      2014-06-08 2016-10-10              855
 6     6   14274    76 female <NA>      2015-04-05 2016-01-23              293
 7     7   14132    16 male   confirmed NA         2015-10-05               NA
 8     8   14715    44 female confirmed NA         2016-04-24               NA
 9     9   13435    26 male   <NA>      2014-07-09 2014-09-20               73
10    10   14816    30 female <NA>      2015-06-29 2015-02-06             -143
# ℹ 14,990 more rows

Discussion

Example scenario: an unexpected change

You are in an emergency response situation. You need to generate daily situation reports. You automated your analysis to read data directly from the online server. However, the people in charge of the data collection/administration needed to remove/rename/reformat one variable you found helpful!

How can you detect if the input data is still valid to replicate the analysis code you wrote the day before?

Creating a linelist and tagging columns

Before diving in, it helps to distinguish the two steps: tagging attaches a semantic role (such as case ID or date of onset) to a column in your dataset, while validation checks that the tagged columns still exist and have the expected data types. Tagging is done once when you build the linelist object; validation is something you can run repeatedly as the underlying data evolves.

Once the data is loaded and cleaned, we can convert the cleaned case data into a linelist object using the linelist package, as in the code chunk below.

R

# Create a linelist object from cleaned data
linelist_data <- linelist::make_linelist(
  x = cleaned_data, # Input data
  id = "case_id", # Column for unique case identifiers
  date_onset = "date_onset", # Column for date of symptom onset
  gender = "gender" # Column for gender
)

# Display the resulting linelist object
linelist_data

OUTPUT


// linelist object
# A tibble: 15,000 × 8
      v1 case_id   age gender status    date_onset date_sample reporting_delay
   <int>   <int> <dbl> <chr>  <chr>     <IDate>    <IDate>               <int>
 1     1   14905    90 male   confirmed 2015-03-15 2015-04-06               22
 2     2   13043    25 female <NA>      2013-09-11 2014-01-03              114
 3     3   14364    54 female <NA>      2014-02-09 2015-03-03              387
 4     4   14675    90 <NA>   <NA>      2014-10-19 2014-12-31               73
 5     5   12648    74 female <NA>      2014-06-08 2016-10-10              855
 6     6   14274    76 female <NA>      2015-04-05 2016-01-23              293
 7     7   14132    16 male   confirmed NA         2015-10-05               NA
 8     8   14715    44 female confirmed NA         2016-04-24               NA
 9     9   13435    26 male   <NA>      2014-07-09 2014-09-20               73
10    10   14816    30 female <NA>      2015-06-29 2015-02-06             -143
# ℹ 14,990 more rows

// tags: id:case_id, date_onset:date_onset, gender:gender

The linelist package supplies tags for common epidemiological variables and a set of appropriate data types for each. You can view the list of available tag names and their acceptable data types using the linelist::tags_types() function.

Challenge

Let’s now tag additional variables. In some datasets, variable names may not exactly match the predefined tag names. In these cases, you can map them based on how the variables were defined during data collection. You need to:

Explore the available tag names in linelist.
Find what other variables in the input dataset can be associated with any of these available tags.
Tag those variables as shown above using the linelist::make_linelist() function.

Give me a hint

Your can get access to the list of available tag names in linelist using:

R

# Get a list of available tags names and data types
linelist::tags_types()

# Get a list of names only
linelist::tags_names()

Show me the solution

R

linelist::make_linelist(
  x = cleaned_data,
  id = "case_id",
  date_onset = "date_onset",
  gender = "gender",
  age = "age",
  # same name in default list and dataset
  date_reporting = "date_sample" # different names but related
)

Are the additional tags visible in the output?

Do you want to see a display of available and tagged variables? You can explore the function linelist::tags() and read its reference documentation.

Validation

Recall the scenario above, where an upstream change to the data (a removed, renamed, or reformatted variable) could quietly break your analysis. Validation is the check that catches this: running linelist::validate_linelist() confirms that every tagged column is still present and still has the expected data type. In an ongoing analysis, you can re-run it each time fresh data arrives, so that any breaking change is flagged immediately rather than propagating downstream.

To ensure that all tagged variables are standardized and have the correct data types, use the linelist::validate_linelist() function, as shown in the example below:

R

linelist::validate_linelist(linelist_data)

OUTPUT

'linelist_data' is a valid linelist object

If your dataset requires a new tag other than those defined in the package linelist, use allow_extra = TRUE when creating the linelist object with its corresponding data type using the function linelist::make_linelist().

Challenge

Changes in Variable Types During Linelist Validation

Let’s assume the following scenario during an ongoing outbreak. You notice at some point that the data stream you have been relying on has a set of new entries (i.e., rows or observations), and the data type of one variable has changed.

Let’s consider the example where the type of the age variable has changed from a double (<numeric>) to character (<character>).

To simulate this situation:

Change the data type of the variable
Tag the variable into a linelist
Validate the linelist

Describe how linelist::validate_linelist() reacts when there is a change in the data type of one variable of the input data.

Give me a hint

We can use dplyr::mutate() to change the variable type before tagging for validation. For example:

R

# nolint start

cleaned_data %>%
  # simulate a change of data type in one variable
  dplyr::mutate(age = as.character(age)) %>%
  # tag one variable
  linelist::.... %>%
  # validate the linelist
  linelist::...

# nolint end

Give me a hint

Please run the code line by line, focusing only on the parts before the pipe (%>%). After each step, observe the output before moving to the next line.

If the age variable changes from double (<dbl>) to character (<chr>) we get the following:

R

cleaned_data %>%
  # simulate a change of data type in one variable
  dplyr::mutate(age = as.character(age)) %>%
  # tag one variable
  linelist::make_linelist(age = "age") %>%
  # validate the linelist
  linelist::validate_linelist()

ERROR

Error:
! Some tags have the wrong class:
  - age: Must inherit from class 'numeric'/'integer', but has class 'character'

Why are we getting an Error message?

Explore other situations to understand this behavior by converting:

date_onset from <Date> to <character>
gender from <character> to <integer>

Then tag them into a linelist for validation. Does the Error message suggest a fix to the issue?

Show me the solution

R

# Change 2
# Run this code line by line to identify changes
cleaned_data %>%
  # simulate a change of data type
  dplyr::mutate(date_onset = as.character(date_onset)) %>%
  # tag
  linelist::make_linelist(date_onset = "date_onset") %>%
  # validate
  linelist::validate_linelist()

R

# Change 3
# Run this code line by line to identify changes
cleaned_data %>%
  # simulate a change of data type
  dplyr::mutate(gender = as.factor(gender)) %>%
  dplyr::mutate(gender = as.integer(gender)) %>%
  # tag
  linelist::make_linelist(gender = "gender") %>%
  # validate
  linelist::validate_linelist()

We get Error messages because the default type of these variables in linelist::tags_types() is different from the one we have assigned.

The Error message informs us that in order to validate our linelist, we must fix the input variable type to fit the expected tag type. In a data analysis script, we can do this by adding one cleaning step into the pipeline.

Checklist

Until now, a typical workflow can look like this:

R

# use cleaned data
cleaned_data %>%
  # tag as many variables as possible
  # creates the <linelist> class object
  linelist::make_linelist(
    id = "case_id",
    date_onset = "date_onset",
    gender = "gender"
  ) %>%
  # validate the linelist
  linelist::validate_linelist()

OUTPUT

'.' is a valid linelist object

Safeguarding

Safeguarding is implicitly built into the linelist objects. If you try to drop any of the tagged columns, you will receive an error or warning message, as shown in the example below.

R

new_df <- linelist_data %>%
  dplyr::select(case_id, gender)

WARNING

Warning: The following tags have lost their variable:
 date_onset:date_onset

The Warning message above is the default output option when we lose tags in a linelist object. However, it can be changed to an Error message using the linelist::lost_tags_action() function.

Challenge

Exploring Safeguarding Behavior for Lost Tags

Let’s test the implications of changing the safeguarding configuration from a Warning to an Error message.

First, run this code to count the frequency of each category within a categorical variable:

R

linelist_data %>%
  dplyr::select(case_id, gender) %>%
  dplyr::count(gender)

Set the behavior for lost tags in a linelist to “error” as follows:

R

# set behavior to "error"
linelist::lost_tags_action(action = "error")

Now, re-run the above code chunk with dplyr::count().

Identify:

What is the difference in the output between a Warning and an Error?
What could be the implications of this change for your daily data analysis pipeline during an outbreak response?

Show me the solution

Deciding between Warning or Error message will depend on the level of attention or flexibility you need when losing tags. A Warning will alert you about a change but will continue running the code downstream. An Error will stop your analysis pipeline and the rest will not be executed.

A data reading, cleaning and validation script may require a more stable or fixed pipeline. An exploratory data analysis may require a more flexible approach. These two processes can be isolated in different scripts or repositories to adjust the safeguarding according to your needs.

Before you continue, set the configuration back to the default option of Warning:

R

# set behavior to the default option: "warning"
linelist::lost_tags_action()

OUTPUT

Lost tags will now issue a warning.

A linelist object resembles a data frame but offers richer features and functionalities. Packages that are linelist-aware can leverage these features. For example, you can extract a data frame of only the tagged columns using the linelist::tags_df() function, as shown below:

R

linelist::tags_df(linelist_data)

OUTPUT

# A tibble: 15,000 × 3
      id date_onset gender
   <int> <IDate>    <chr>
 1 14905 2015-03-15 male
 2 13043 2013-09-11 female
 3 14364 2014-02-09 female
 4 14675 2014-10-19 <NA>
 5 12648 2014-06-08 female
 6 14274 2015-04-05 female
 7 14132 NA         male
 8 14715 NA         female
 9 13435 2014-07-09 male
10 14816 2015-06-29 female
# ℹ 14,990 more rows

This allows for the use of tagged variables only in downstream analysis, which will be useful for the next episode (Aggregate and visualize)!

Get a one chunk version of all the steps learned in this episode in the spoiler below.

Show details

You can do all these steps connected in a single pipe:

R

# use cleaned data
cleaned_data %>%
  # tag as many variables as possible
  # creates the <linelist> class object
  linelist::make_linelist(
    id = "case_id",
    date_onset = "date_onset",
    gender = "gender"
  ) %>%
  # validate the linelist
  linelist::validate_linelist() %>%
  # extract a df with standard column names
  linelist::tags_df()

OUTPUT

'.' is a valid linelist object

OUTPUT

# A tibble: 15,000 × 3
      id date_onset gender
   <int> <IDate>    <chr>
 1 14905 2015-03-15 male
 2 13043 2013-09-11 female
 3 14364 2014-02-09 female
 4 14675 2014-10-19 <NA>
 5 12648 2014-06-08 female
 6 14274 2015-04-05 female
 7 14132 NA         male
 8 14715 NA         female
 9 13435 2014-07-09 male
10 14816 2015-06-29 female
# ℹ 14,990 more rows

Checklist

When should I use `{linelist}`?

Data analysis during an outbreak response or mass-gathering surveillance demands a different set of data safeguards if compared to usual research situations. For example, your data will change or be updated over time (e.g., new entries, new variables, renamed variables).

linelist is more appropriate for this type of ongoing or long-lasting analysis. Check the “Get started” vignette section about When I should consider using {linelist}? for more information.

Key Points

Use the linelist package to tag, validate, and prepare case data for downstream analysis.
Explore and map dataset variables to predefined tags for standardization.
Understand how Warnings vs. Errors affect the data processing workflow.

Validate case data

Overview

Questions

Objectives

Introduction

R

The double-colon (::) operator

R

OUTPUT

Example scenario: an unexpected change

Creating a linelist and tagging columns

R

OUTPUT

Challenge

Give me a hint

R

Show me the solution

R

Validation

R

OUTPUT

Changes in Variable Types During Linelist Validation

Give me a hint

R

Give me a hint

R

ERROR

Show me the solution

R

R

R

OUTPUT

Safeguarding

R

WARNING

Exploring Safeguarding Behavior for Lost Tags

R

R

Show me the solution

R

OUTPUT

R

OUTPUT

Show details

R

OUTPUT

OUTPUT

When should I use {linelist}?

The double-colon (`::`) operator

When should I use `{linelist}`?