Clean and validate

Last updated on 2024-04-29 | Edit this page

Estimated time: 30 minutes

Overview

Questions

  • How to clean and standardize case data?
  • How to convert raw dataset into a linelist object?

Objectives

  • Explain how to clean, curate, and standardize case data using {cleanepi} package
  • Demonstrate how to covert case data to linelist data

Introduction


In the process of analyzing outbreak data, it’s essential to ensure that the dataset is clean, curated, standardized, and validate to facilitate accurate and reproducible analysis. This episode focuses on cleaning epidemics and outbreaks data using the cleanepi package, and validate it using the linelist package. For demonstration purposes, we’ll work with a simulated dataset of Ebola cases.

The first step is to import the dataset following the guidelines outlined in the Read case data episode. This involves loading the dataset into our environment and view its structure and content.

R

# Load packages
library("rio")
library("here")

# Read data
# e.g.: if path to file is data/raw-data/simulated_ebola_2.csv then:
raw_ebola_data <- rio::import(
  here::here("data", "raw-data", "simulated_ebola_2.csv")
)

R

# Return first five rows
utils::head(raw_ebola_data, 5)

OUTPUT

  V1 case id         age gender    status date onset date sample
1  1   14905          90      1 confirmed 03/15/2015  06/04/2015
2  2   13043 twenty-five      2            Sep /11/Y  03/01/2014
3  3   14364          54      f      <NA> 09/02/2014  03/03/2015
4  4   14675      ninety   <NA>           10/19/2014  31/ 12 /14
5  5   12648          74      F           08/06/2014  10/10/2016

A quick inspection


Quick exploration and inspection of the dataset are crucial before diving into any analysis tasks. The {cleanepi} package simplifies this process with the scan_data() function. Let’s take a look at how you can use it:

R

library("cleanepi")
cleanepi::scan_data(raw_ebola_data)

OUTPUT

  Field_names  missing numeric     date character logical
1          V1 0.000000  1.0000 0.000000  0.000000       0
2     case id 0.000000  1.0000 0.000000  0.000000       0
3         age 0.064600  0.8348 0.000000  0.100600       0
4      gender 0.157867  0.0472 0.000000  0.794933       0
5      status 0.053533  0.0000 0.000000  0.946467       0
6  date onset 0.000067  0.0000 0.915733  0.084200       0
7 date sample 0.000133  0.0000 0.999867  0.000000       0

The results provides an overview of the content of every column, including column names, and the percent of some data types per column. You can see that the column names in the dataset are descriptive but lack consistency, as some they are composed of multiple words separated by white spaces. Additionally, some columns contain more than one data type, and there are missing values in others.

Common operations


This section demonstrate how to perform some common data cleaning operations using the {cleanepi} package.

Standardizing column names

For this example dataset, standardizing column names typically involves removing spaces and connecting different words with “_”. This practice helps maintain consistency and readability in the dataset. However, the function used for standardizing column names offers more options. Type ?cleanepi::standardize_column_names for more details.

R

sim_ebola_data <- cleanepi::standardize_column_names(raw_ebola_data)
names(sim_ebola_data)

OUTPUT

[1] "v_1"         "case_id"     "age"         "gender"      "status"     
[6] "date_onset"  "date_sample"

Challenge

  • What differences you can observe in the column names?

If you want to maintain certain column names without subjecting them to the standardization process, you can utilize the keep parameter of the standardize_column_names() function. This parameter accepts a vector of column names that are intended to be kept unchanged.

Exercise: Standardize the column names of the input dataset, but keep the “V1” column as is.

Removing irregularities

Raw data may contain irregularities such as duplicated and empty rows and columns, as well as constant columns. remove_duplicates and remove_constants functions from {cleanepi} remove such irregularities as demonstrated in the below code chunk.

R

sim_ebola_data <- cleanepi::remove_constant(sim_ebola_data)
sim_ebola_data <- cleanepi::remove_duplicates(sim_ebola_data)

Note that, our simulated Ebola does not contain duplicated nor constant rows or columns.

Replacing missing values

In addition to the regularities, raw data can contain missing values that may be encoded by different strings, including the empty. To ensure robust analysis, it is a good practice to replace all missing values by NA in the entire dataset. Below is a code snippet demonstrating how you can achieve this in {cleanepi}:

R

sim_ebola_data <- cleanepi::replace_missing_values(sim_ebola_data)

Validating subject IDs

Each entry in the dataset represents a subject and should be distinguishable by a specific column formatted in a particular way, such as falling within a specified range, containing certain prefixes and/or suffixes, containing a specific number of characters. The {cleanepi} package offers the check_subject_ids function designed precisely for this task as shown in the below code chunk. This function validates whether they are unique and meet the required criteria.

R

# remove this chunk code once {cleanepi} is updated.
# The coercion made here will be accounted for within {cleanepi}
sim_ebola_data$case_id <- as.character(sim_ebola_data$case_id)

R

sim_ebola_data <- cleanepi::check_subject_ids(sim_ebola_data,
  target_columns = "case_id",
  range = c(0, 15000)
)

OUTPUT

Found 1957 duplicated rows. Please consult the report for more details.

Note that our simulated dataset does contain duplicated subject IDS.

Standardizing dates

Certainly an epidemic dataset contains date columns for different events, such as the date of infection, date of symptoms onset, ..etc, and these dates can come in different date forms, and it good practice to unify them. The {cleanepi} package provides functionality for converting date columns in epidemic datasets into ISO format, ensuring consistency across the different date columns. Here’s how you can use it on our simulated dataset:

R

sim_ebola_data <- cleanepi::standardize_dates(
  sim_ebola_data,
  target_columns = c(
    "date_onset",
    "date_sample"
  )
)

utils::head(sim_ebola_data)

OUTPUT

  v_1 case_id         age gender    status date_onset date_sample
1   1   14905          90      1 confirmed 2015-03-15  2015-04-06
2   2   13043 twenty-five      2      <NA>       <NA>  2014-01-03
3   3   14364          54      f      <NA> 2014-02-09  2015-03-03
4   4   14675      ninety   <NA>      <NA> 2014-10-19  2014-12-31
5   5   12648          74      F      <NA> 2014-06-08  2016-10-10
6   6   14274 seventy-six female      <NA>       <NA>  2016-01-23

This function coverts the values in the target columns, or will automatically figure out the date columns within the dataset (if target_columns = NULL) and convert them into the Ymd format.

Converting to numeric values

In the raw dataset, some column can come with mixture of character and numerical values, and you want to covert the character values explicitly into numeric. For example, in our simulated data set, in the age column some entries are written in words. The convert_to_numeric() function in {cleanepi} does such conversion as illustrated in the below code chunk.

R

sim_ebola_data <- cleanepi::convert_to_numeric(sim_ebola_data,
  target_columns = "age"
)
utils::head(sim_ebola_data)

OUTPUT

  v_1 case_id age gender    status date_onset date_sample
1   1   14905  90      1 confirmed 2015-03-15  2015-04-06
2   2   13043  25      2      <NA>       <NA>  2014-01-03
3   3   14364  54      f      <NA> 2014-02-09  2015-03-03
4   4   14675  90   <NA>      <NA> 2014-10-19  2014-12-31
5   5   12648  74      F      <NA> 2014-06-08  2016-10-10
6   6   14274  76 female      <NA>       <NA>  2016-01-23

Multiple operations at once


Performing data cleaning operations individually can be time-consuming and error-prone. The {cleanepi} package simplifies this process by offering a convenient wrapper function called clean_data(), which allows you to perform multiple operations at once.

The clean_data() function applies a series of predefined data cleaning operations to the input dataset. Here’s an example code chunk illustrating how to use clean_data() on a raw simulated Ebola dataset:

Further more, you can combine multiple data cleaning tasks via the pipe operator in “|>”, as shown in the below code snippet.

R

# remove the line below once Karim has updated cleanepi
raw_ebola_data$`case id` <- as.character(raw_ebola_data$`case id`)
# PERFORM THE OPERATIONS USING THE pipe SYNTAX
cleaned_data <- raw_ebola_data |>
  cleanepi::standardize_column_names(keep = "V1", rename = NULL) |>
  cleanepi::replace_missing_values(target_columns = NULL) |>
  cleanepi::remove_constant(cutoff = 1.0) |>
  cleanepi::remove_duplicates(target_columns = NULL) |>
  cleanepi::standardize_dates(
    target_columns = c("date_onset", "date_sample"),
    error_tolerance = 0.4,
    format = NULL,
    timeframe = NULL
  ) |>
  cleanepi::check_subject_ids(
    target_columns = "case_id",
    range = c(1, 15000)
  ) |>
  cleanepi::convert_to_numeric(target_columns = "age") |>
  cleanepi::clean_using_dictionary(dictionary = test_dict)

OUTPUT

Found 1957 duplicated rows. Please consult the report for more details.

Printing the clean report


The {cleanepi} package generates a comprehensive report detailing the findings and actions of all data cleansing operations conducted during the analysis. This report is presented as a webpage with multiple sections. Each section corresponds to a specific data cleansing operation, and clicking on each section allows you to access the results of that particular operation. This interactive approach enables users to efficiently review and analyze the outcomes of individual cleansing steps within the broader data cleansing process.

You can view the report using cleanepi::print_report() function.

Example of data cleaning report generated by {cleanepi}
Example of data cleaning report generated by {cleanepi}

Validating and tagging case data


In outbreak analysis, once you have completed the initial steps of reading and cleaning the case data, it’s essential to establish an additional foundational layer to ensure the integrity and reliability of subsequent analyses. Specifically, this involves verifying the presence and correct data type of certain input columns within your dataset, a process commonly referred to as “tagging.” Additionally, it’s crucial to implement measures to validate that these tagged columns are not inadvertently deleted during further data processing steps.

This is achieved by converting the cleaned case data into a linelist object using linelist package, see the below code chunk.

R

library("linelist")
data <- linelist::make_linelist(cleaned_data,
  id = "case_id",
  age = "age",
  date_onset = "date_onset",
  date_reporting = "date_sample",
  gender = "gender"
)
utils::head(data, 7)

OUTPUT


// linelist object
  V1 case_id age gender    status date_onset date_sample
1  1   14905  90   male confirmed 2015-03-15  2015-04-06
2  2   13043  25 female      <NA>       <NA>  2014-01-03
3  3   14364  54 female      <NA> 2014-02-09  2015-03-03
4  4   14675  90   <NA>      <NA> 2014-10-19  2014-12-31
5  5   12648  74 female      <NA> 2014-06-08  2016-10-10
6  6   14274  76 female      <NA>       <NA>  2016-01-23
7  7   14132  16   male confirmed       <NA>  2015-10-05

// tags: id:case_id, date_onset:date_onset, date_reporting:date_sample, gender:gender, age:age 

Key Points

  • Use {cleanepi} package to clean and standardize epidemic and outbreak data
  • Use linelist to tagg, validate, and prepare case data for downstream analysis.