Concept and motivation
In this document, we will outline the design decisions that have steered the development strategies of the {cleanepi} R package, along with the rationale behind each decision and the potential advantages and disadvantages associated with them.
Data cleaning is an important phase for ensuring the efficacy of downstream analysis. The procedures entailed in the cleaning process may differ based on the data type and research objectives. Nonetheless, certain steps can be applied universally across diverse data types, irrespective of their origin.
Design decisions
The {cleanepi} R package is designed to offer functional programming-style data cleansing tasks. To streamline the organization of data cleaning operations, we have categorized them into distinct groups referred to as modules. These modules are based on overarching goals derived from commonly anticipated data cleaning procedures. Each module features a primary function along with additional helper functions tailored to accomplish specific tasks. It’s important to note that, except for few cases where the outcome a helper function can impact on the cleaning task, only the main function of each module will be exported. This deliberate choice empowers users to execute individual cleaning tasks as needed, enhancing flexibility and usability.
At the core of {cleanepi}, the pivotal function
clean_data()
serves as a wrapper encapsulating all the
modules, as illustrated in the figure above. This function is intended
to be the primary entry point for users seeking to cleanse their data.
It performs the cleaning operations as requested by the user through the
set of parameters that need to be explicitly defined. Furthermore,
multiple cleaning operations can be performed sequentially using the
“pipe” operator (%>%
). In addition, this package also
has two surrogate functions:
-
scan_data()
: This function enables users to assess the data types present in each column of their dataset. -
print_report()
: By utilizing this function, users can visualize the report generated from each applied cleaning task, facilitating transparency and understanding of the data cleaning process.
Scope
{cleanepi} is an R package crafted to clean, curate, and standardize tabular datasets, with a particular focus on epidemiological data. In the architecture of {cleanepi}, the data cleaning operations are categorized into modules, each provides a specific data cleaning task. The modules in the current version of {cleanepi} encompass the:
- standardization of column names,
- detection and removal of duplicates,
- replacement of missing values with
NA
, - standardization of subject IDs,
- standardization of date values,
- replacement of existing values with predefined ones (dictionary based substitutions),
- conversion of values when necessary,
- verification of the sequence order of date-events, and
- transformation of selected columns.
By compartmentalizing these operations into modules, {cleanepi} offers users a systematic and adaptable framework to address diverse data cleaning needs, especially within the realm of epidemiological datasets.
Input
The primary functions of the modules, as well as the core function
clean_data()
, accept input in the form of a
data.frame
or linelist
. This offers
flexibility for users regarding where they want to position {cleanepi}
within the R package ecosystem for epidemic analysis pipelines, either
to clean data before or after converting it to a
linelist
.
In addition to the target dataset, the core function
clean_data()
accepts a list
of operations to
be executed on the dataset. It subsequently invokes the primary
functions specified for each module.
Output
Both the primary functions of the modules and the core function
clean_data()
return an object of type
data.frame
or linelist
, depending on the type
of the input dataset. The report generated from all cleaning tasks is
attached to this object as an attribute, which can be accessed using the
attr()
function in base R.
Modules in {cleanepi}
In this section, we provide a detailed description of the way that every module is built.
1. Standardization of column names
This module is designed to standardize the style and format of column names within the target dataset, offering users the flexibility to specify a subset of:
focal columns to preserve in their original format, and
columns to be renamed i.e. given a new name chosen by the user.
Main function:
standardize_column_names()
-
Input:
- A
data.frame
orlinelist
object. - Optionally, a
vector
of focal column names and a vector of column names to be renamed in the form ofnew_name = "old_name"
. If not provided, all columns will undergo standardization.
- A
-
Output:
- The input object with standardized column names.
-
Report:
- A two-column table displaying the initial and current column names for each updated column in the original dataset.
-
Mode:
- Explicit
By incorporating the standardize_column_names()
function, {cleanepi} streamlines the process of ensuring consistency and
clarity in column naming conventions, thereby enhancing the overall
organization and readability of the dataset.
2. Removal of empty rows and columns and constant columns
This module aims at eliminating irrelevant and redundant rows and columns, including empty rows and columns as well as constant columns.
-
Main function:
remove_constants()
-
Input: Accepts a
data.frame
orlinelist
object, along with:- A cut-off that determines the percent of missing values beyond which a row or column should be deleted.
- Output: Returns the input object after applying the specified operations.
-
Report:
- A list with one or two or three vector(s) showcasing detected empty rows, empty columns and constant columns.
-
Mode:
- explicit
3. Detection and removal of duplicates
This module is designed to identify and eliminate duplicated rows.
-
Main function:
find_duplicates(), remove_duplicates()
-
Input: Accepts a
data.frame
orlinelist
object, along with optional parameters:- Vector of target columns (default is to consider all columns;
possible to use
linelist_tags
to consider tagged variables only when the input is a linelist object).
- Vector of target columns (default is to consider all columns;
possible to use
- Output: Returns the input object after applying the specified operations.
-
Report:
- A list with one or two table(s) showcasing found duplicates and
removed duplicates (if
remove = TRUE
). - A table detailing the removed duplicates.
- A list with one or two table(s) showcasing found duplicates and
removed duplicates (if
-
Mode:
- explicit
Through the remove_duplicates()
function, users can
streamline their dataset by eliminating redundant rows, thus enhancing
data integrity and analysis efficiency.
4. Replacement of missing values with
NA
This module aims to standardize and unify the representation of missing values within the dataset.
-
Main function:
replace_missing_values()
-
Input: Accepts a
data.frame
orlinelist
object, along with:- A
vector
of column names (if not provided, the operation is performed across all columns) - A string or a vector of strings representing the current missing values (default value is cleanepi::common_na_strings)
- A
-
Output: Returns the input object with all missing
values represented as
NA
. - Report: Generates a three-column table featuring index, column, and value for each missing item in the dataset.
-
Mode:
- explicit
By utilizing the replace_missing_char()
function, users
can ensure consistency in handling missing values across their dataset,
facilitating accurate analysis and interpretation of the data.
5. Standardization of date values
This module is dedicated to convert date values in character columns
into Date
value in ISO8601
format, and
ensuring that all dates fall within the given timeframe.
-
Main function:
standardize_dates()
-
Input: Accepts a
data.frame
orlinelist
object, along with:- A
vector
of targeted date columns (automatically determined if not provided) - Tolerance threshold that defines the % of missing (out of range
values converted to
NA
) values to be allowed in a converted column. When % missing values exceeds or is equal to it, the original values are returned (default value is 40%) - format (default value is NULL)
- timeframe (default value is null)
- A
- Output: Returns the input object with standardized date values in the format of %Y-%m-%d.
-
Report:
- A two-column table listing the columns where date values were standardized.
- A three-column table featuring index, column name, and values that fall outside the specified timeframe.
- A data frame featuring date values that can comply with more than one specified format. Users can update the standardized date values with the correct when it’s appropriated.
-
Mode:
- explicit
By employing the standardize_dates()
function, users can
ensure uniformity and coherence in date formats across their dataset,
while also validating the temporal integrity of the data within the
defined timeframe.
6. Standardization of subject IDs
This module is tailored to verify whether the values in the column uniquely identifying subjects adhere to a consistent format. It also offers a functionality that allow users to correct the inconsistent subject ids.
-
Main function:
check_subject_ids(), correct_subject_ids()
-
Input: Accepts a
data.frame
orlinelist
object, along with:- The name of the ID column
- Strings for prefix, suffix, and numerical range within the IDs
- A logical that determines whether to delete the wrong ids or not.
- Output: Returns the input object with standardized subject IDs.
- Report: Generates a two-column table featuring index and value of each subject ID that deviates from the expected format.
-
Mode:
- explicit
By utilizing the functions in this module, users can ensure uniformity in the format of subject ids, facilitating accurate tracking and analysis of individual subjects within the dataset.
7. Dictionary based substitution
This module facilitates dictionary-based substitution, which involves
replacing existing values with predefined ones. It replaces entries in a
specific columns to certain values, such as substituting 1 with “male”
and 2 with “female” in a gender column. It also interoperates seamlessly
with the get_meta_data()
function from {readepi} R
package.
-
Main function:
clean_using_dictionary()
-
Input: Accepts a
data.frame
orlinelist
object, along with a data dictionary featuring the following column names: options, values, and order. - Output: Returns the input object where the specified options are replaced by their corresponding values.
- Report: Generates a three-column table with index, column, and value for each unexpected value encountered in a targeted column.
-
Mode:
- explicit
By leveraging the clean_using_dictionary()
function,
users can streamline and standardize the values within specific columns
based on predefined mappings, enhancing consistency and accuracy in the
dataset.
Note that the clean_using_dictionary()
function will
return a warning when it detects unexpected values in the target columns
from the data dictionary. The unexpected values can be added to the data
dictionary using the add_to_dictionary()
function.
8. Conversion of values when necessary
This module is designed to convert numbers written in letters to numerical values, ensuring interoperability with the numberize package.
-
Main function:
convert_to_numeric()
-
Input: Accepts a
data.frame
orlinelist
object, along with:- A vector of column names to be converted into numeric, and
- A string that denotes the language used in the text. The current
version supports
English, French and Spanish
.
- Output: Returns the input object with values in the target columns converted into numeric format.
- Report: Generates a three-column table with index, column, and value for each unrecognized value in the dataset (strings that could not be converted into numeric).
-
Mode:
- explicit
By employing the convert_to_numeric()
function, users
can seamlessly transform numeric representations written in letters into
numerical values, ensuring compatibility with the {numberize} package
and promoting accuracy in numerical analysis.
Note that convert_to_numeric()
will issue a warning for
unexpected values and return them in the report.
9. Verification of the sequence of date-events
This module provides functions to verify whether the sequence of date events aligns with expectations. For instance, it can flag rows where the date of admission to the hospital precedes the individual’s date of birth.
-
Main function:
check_date_sequence()
-
Input: Accepts a
data.frame
orlinelist
object, along with:- A vector of date column names to be considered
- Output: Returns the input object with incorrect rows removed if specified.
- Report: Generates a table listing the incorrect rows from the specified columns.
-
Mode:
- explicit
By using the check_date_sequence()
function, users can
systematically validate and ensure the coherence of date sequences
within their dataset, promoting accuracy and reliability in subsequent
analyses.
10. Transformation of selected columns
This module is dedicated to performing various specialized operations related to epidemiological data analytics, and it currently includes the following functions:
-
Main function:
timespan()
-
Input: Accepts a
data.frame
orlinelist
object, along with:- The name of the column of interest
- The reference date value or column
- The time and remainder units (possible values are days, weeks, months, or years)
- The names of the newly generated columns
- Output: Returns the input object with clean data and one or two additional columns with values in the specified time unit.
- Report: None.
-
Mode:
- explicit
By leveraging the timespan()
function, users can
efficiently compute and integrate time span information into their
epidemiological dataset based on user-defined parameters, enhancing the
analytics capabilities of the dataset.
Surrogate functions
scan_data()
: This function is designed to generate a quick summary of the dataset, offering insights into the composition of each column. It calculates the percentage of values belonging to different data types such as character, numeric, missing, logical, and date. This summary can help analysts and data scientists understand the structure and content of the dataset at a glance.print_report()
: This function is used for displaying the report detailing the result of the cleaning operations executed on the dataset. It likely presents information about the data cleaning processes performed, such as handling missing values, correcting data types, removing duplicates, and any other transformations applied to ensure data quality and integrity.
These surrogate functions play crucial roles in the data analysis and cleaning workflow, providing valuable information and documentation about the dataset characteristics and the steps taken to prepare it for analysis or modelling.
Related packages
{janitor}, {matchmaker}, {naniar}, {daiquiri} {epiCleanr} {matchmaker}
Dependencies
The modules and surrogate functions will depend mainly on the following packages:
{numberize}
used for the conversion of number from character to numeric, {dplyr} used in many
way including filtering, column creation, data summary, etc, {magrittr} used
here for its %>%
operator, {linelist} used
to perform some operations on linelist-type input objects, {janitor} used
here for the removal of constant data (empty rows and columns, as well
as constant columns), {matchmaker}
utilized to perform the dictionary-based cleaning, {lubridate} used to
create, handle, and manipulate objects of type Date, {reactable}
mainly used here to customize the data cleaning report, {snakecase} used
in standardizing column names to transform everything into snake-case
except when specified otherwise, {withr} utilized to
handle the creation of temporary files and directory relevant for
print_report()
and, {readr} used to
import data.
The functions will require all other packages that needed in the package development process including:
{checkmate}, {kableExtra}, {bookdown}, {rmarkdown}, {testthat} (>= 3.0.0), {knitr}, {lintr}
Contribute
There are no special requirements to contributing to {cleanepi}, please follow the package contributing guide.