Clean and validate
Last updated on 2024-04-29 | Edit this page
Estimated time: 30 minutes
Overview
Questions
- How to clean and standardize case data?
- How to convert raw dataset into a
linelist
object?
Objectives
- Explain how to clean, curate, and standardize case data using
{cleanepi}
package - Demonstrate how to covert case data to
linelist
data
Introduction
In the process of analyzing outbreak data, it’s essential to ensure that the dataset is clean, curated, standardized, and validate to facilitate accurate and reproducible analysis. This episode focuses on cleaning epidemics and outbreaks data using the cleanepi package, and validate it using the linelist package. For demonstration purposes, we’ll work with a simulated dataset of Ebola cases.
The first step is to import the dataset following the guidelines outlined in the Read case data episode. This involves loading the dataset into our environment and view its structure and content.
R
# Load packages
library("rio")
library("here")
# Read data
# e.g.: if path to file is data/raw-data/simulated_ebola_2.csv then:
raw_ebola_data <- rio::import(
here::here("data", "raw-data", "simulated_ebola_2.csv")
)
R
# Return first five rows
utils::head(raw_ebola_data, 5)
OUTPUT
V1 case id age gender status date onset date sample
1 1 14905 90 1 confirmed 03/15/2015 06/04/2015
2 2 13043 twenty-five 2 Sep /11/Y 03/01/2014
3 3 14364 54 f <NA> 09/02/2014 03/03/2015
4 4 14675 ninety <NA> 10/19/2014 31/ 12 /14
5 5 12648 74 F 08/06/2014 10/10/2016
A quick inspection
Quick exploration and inspection of the dataset are crucial before
diving into any analysis tasks. The {cleanepi}
package
simplifies this process with the scan_data()
function.
Let’s take a look at how you can use it:
R
library("cleanepi")
cleanepi::scan_data(raw_ebola_data)
OUTPUT
Field_names missing numeric date character logical
1 V1 0.000000 1.0000 0.000000 0.000000 0
2 case id 0.000000 1.0000 0.000000 0.000000 0
3 age 0.064600 0.8348 0.000000 0.100600 0
4 gender 0.157867 0.0472 0.000000 0.794933 0
5 status 0.053533 0.0000 0.000000 0.946467 0
6 date onset 0.000067 0.0000 0.915733 0.084200 0
7 date sample 0.000133 0.0000 0.999867 0.000000 0
The results provides an overview of the content of every column, including column names, and the percent of some data types per column. You can see that the column names in the dataset are descriptive but lack consistency, as some they are composed of multiple words separated by white spaces. Additionally, some columns contain more than one data type, and there are missing values in others.
Common operations
This section demonstrate how to perform some common data cleaning
operations using the {cleanepi}
package.
Standardizing column names
For this example dataset, standardizing column names typically
involves removing spaces and connecting different words with “_”. This
practice helps maintain consistency and readability in the dataset.
However, the function used for standardizing column names offers more
options. Type ?cleanepi::standardize_column_names
for more
details.
R
sim_ebola_data <- cleanepi::standardize_column_names(raw_ebola_data)
names(sim_ebola_data)
OUTPUT
[1] "v_1" "case_id" "age" "gender" "status"
[6] "date_onset" "date_sample"
If you want to maintain certain column names without subjecting them
to the standardization process, you can utilize the keep
parameter of the standardize_column_names()
function. This
parameter accepts a vector of column names that are intended to be kept
unchanged.
Exercise: Standardize the column names of the input dataset, but keep the “V1” column as is.
Removing irregularities
Raw data may contain irregularities such as duplicated and empty rows
and columns, as well as constant columns. remove_duplicates
and remove_constants
functions from {cleanepi}
remove such irregularities as demonstrated in the below code chunk.
R
sim_ebola_data <- cleanepi::remove_constant(sim_ebola_data)
sim_ebola_data <- cleanepi::remove_duplicates(sim_ebola_data)
Note that, our simulated Ebola does not contain duplicated nor constant rows or columns.
Replacing missing values
In addition to the regularities, raw data can contain missing values
that may be encoded by different strings, including the empty. To ensure
robust analysis, it is a good practice to replace all missing values by
NA
in the entire dataset. Below is a code snippet
demonstrating how you can achieve this in {cleanepi}
:
R
sim_ebola_data <- cleanepi::replace_missing_values(sim_ebola_data)
Validating subject IDs
Each entry in the dataset represents a subject and should be
distinguishable by a specific column formatted in a particular way, such
as falling within a specified range, containing certain prefixes and/or
suffixes, containing a specific number of characters. The
{cleanepi}
package offers the
check_subject_ids
function designed precisely for this task
as shown in the below code chunk. This function validates whether they
are unique and meet the required criteria.
R
# remove this chunk code once {cleanepi} is updated.
# The coercion made here will be accounted for within {cleanepi}
sim_ebola_data$case_id <- as.character(sim_ebola_data$case_id)
R
sim_ebola_data <- cleanepi::check_subject_ids(sim_ebola_data,
target_columns = "case_id",
range = c(0, 15000)
)
OUTPUT
Found 1957 duplicated rows. Please consult the report for more details.
Note that our simulated dataset does contain duplicated subject IDS.
Standardizing dates
Certainly an epidemic dataset contains date columns for different
events, such as the date of infection, date of symptoms onset, ..etc,
and these dates can come in different date forms, and it good practice
to unify them. The {cleanepi}
package provides
functionality for converting date columns in epidemic datasets into ISO
format, ensuring consistency across the different date columns. Here’s
how you can use it on our simulated dataset:
R
sim_ebola_data <- cleanepi::standardize_dates(
sim_ebola_data,
target_columns = c(
"date_onset",
"date_sample"
)
)
utils::head(sim_ebola_data)
OUTPUT
v_1 case_id age gender status date_onset date_sample
1 1 14905 90 1 confirmed 2015-03-15 2015-04-06
2 2 13043 twenty-five 2 <NA> <NA> 2014-01-03
3 3 14364 54 f <NA> 2014-02-09 2015-03-03
4 4 14675 ninety <NA> <NA> 2014-10-19 2014-12-31
5 5 12648 74 F <NA> 2014-06-08 2016-10-10
6 6 14274 seventy-six female <NA> <NA> 2016-01-23
This function coverts the values in the target columns, or will
automatically figure out the date columns within the dataset (if
target_columns = NULL
) and convert them into the
Ymd format.
Converting to numeric values
In the raw dataset, some column can come with mixture of character
and numerical values, and you want to covert the character values
explicitly into numeric. For example, in our simulated data set, in the
age column some entries are written in words. The
convert_to_numeric()
function in {cleanepi}
does such conversion as illustrated in the below code chunk.
R
sim_ebola_data <- cleanepi::convert_to_numeric(sim_ebola_data,
target_columns = "age"
)
utils::head(sim_ebola_data)
OUTPUT
v_1 case_id age gender status date_onset date_sample
1 1 14905 90 1 confirmed 2015-03-15 2015-04-06
2 2 13043 25 2 <NA> <NA> 2014-01-03
3 3 14364 54 f <NA> 2014-02-09 2015-03-03
4 4 14675 90 <NA> <NA> 2014-10-19 2014-12-31
5 5 12648 74 F <NA> 2014-06-08 2016-10-10
6 6 14274 76 female <NA> <NA> 2016-01-23
Multiple operations at once
Performing data cleaning operations individually can be
time-consuming and error-prone. The {cleanepi}
package
simplifies this process by offering a convenient wrapper function called
clean_data()
, which allows you to perform multiple
operations at once.
The clean_data()
function applies a series of predefined
data cleaning operations to the input dataset. Here’s an example code
chunk illustrating how to use clean_data()
on a raw
simulated Ebola dataset:
Further more, you can combine multiple data cleaning tasks via the pipe operator in “|>”, as shown in the below code snippet.
R
# remove the line below once Karim has updated cleanepi
raw_ebola_data$`case id` <- as.character(raw_ebola_data$`case id`)
# PERFORM THE OPERATIONS USING THE pipe SYNTAX
cleaned_data <- raw_ebola_data |>
cleanepi::standardize_column_names(keep = "V1", rename = NULL) |>
cleanepi::replace_missing_values(target_columns = NULL) |>
cleanepi::remove_constant(cutoff = 1.0) |>
cleanepi::remove_duplicates(target_columns = NULL) |>
cleanepi::standardize_dates(
target_columns = c("date_onset", "date_sample"),
error_tolerance = 0.4,
format = NULL,
timeframe = NULL
) |>
cleanepi::check_subject_ids(
target_columns = "case_id",
range = c(1, 15000)
) |>
cleanepi::convert_to_numeric(target_columns = "age") |>
cleanepi::clean_using_dictionary(dictionary = test_dict)
OUTPUT
Found 1957 duplicated rows. Please consult the report for more details.
Printing the clean report
The {cleanepi}
package generates a comprehensive report
detailing the findings and actions of all data cleansing operations
conducted during the analysis. This report is presented as a webpage
with multiple sections. Each section corresponds to a specific data
cleansing operation, and clicking on each section allows you to access
the results of that particular operation. This interactive approach
enables users to efficiently review and analyze the outcomes of
individual cleansing steps within the broader data cleansing
process.
You can view the report using cleanepi::print_report()
function.
Validating and tagging case data
In outbreak analysis, once you have completed the initial steps of reading and cleaning the case data, it’s essential to establish an additional foundational layer to ensure the integrity and reliability of subsequent analyses. Specifically, this involves verifying the presence and correct data type of certain input columns within your dataset, a process commonly referred to as “tagging.” Additionally, it’s crucial to implement measures to validate that these tagged columns are not inadvertently deleted during further data processing steps.
This is achieved by converting the cleaned case data into a
linelist
object using linelist package, see
the below code chunk.
R
library("linelist")
data <- linelist::make_linelist(cleaned_data,
id = "case_id",
age = "age",
date_onset = "date_onset",
date_reporting = "date_sample",
gender = "gender"
)
utils::head(data, 7)
OUTPUT
// linelist object
V1 case_id age gender status date_onset date_sample
1 1 14905 90 male confirmed 2015-03-15 2015-04-06
2 2 13043 25 female <NA> <NA> 2014-01-03
3 3 14364 54 female <NA> 2014-02-09 2015-03-03
4 4 14675 90 <NA> <NA> 2014-10-19 2014-12-31
5 5 12648 74 female <NA> 2014-06-08 2016-10-10
6 6 14274 76 female <NA> <NA> 2016-01-23
7 7 14132 16 male confirmed <NA> 2015-10-05
// tags: id:case_id, date_onset:date_onset, date_reporting:date_sample, gender:gender, age:age
Key Points
- Use
{cleanepi}
package to clean and standardize epidemic and outbreak data - Use linelist to tagg, validate, and prepare case data for downstream analysis.