Clean and standardize data — clean

Cleans up messy data frames by performing several operations. These include among others: cleaning of column names, detecting and removing duplicates, empty records and columns, constant columns, replacing missing values by NA, converting character columns into dates when they contain a certain number of date values, detecting subject IDs with wrong formats, etc.

Usage

clean_data(data, ...)

Arguments

data

The input <data.frame> or <linelist>

...

A <list> of cleaning operations to be applied on the input data. The acceptable arguments for ... are:

standardize_column_names: A <list> with the arguments needed to standardize the column names. The elements of this list are the input for the standardize_column_names function.
replace_missing_values: A <list> of parameters to be used when replacing the missing values by NA. The elements of the list are the inputs for the replace_missing_values function.
remove_duplicates: A <list> with the arguments that define the columns and other parameters to be considered when looking for duplicates. They are the input values for the remove_duplicates function.
remove_constants: A <list> with the parameters that define whether to remove constant data or not. The values are the input for the remove_constants function.
standardize_dates: A <list> of parameters that will be used to standardize the date values from the input data. They represent the input values for the standardize_dates function.
standardize_subject_ids: A <list> of parameters that are needed to check the IDs that comply with the expect format. These arguments are the input values of the check_subject_ids.
to_numeric: A <list> with the parameters needed to convert the specified columns into numeric. When provided, the parameters will be the input values for the convert_to_numeric.
dictionary: A <data.frame> that will be used to substitute the current values in the specified columns the those in the dictionary. It is the main argument for the clean_using_dictionary function.
check_date_sequence: A <list> of arguments to be used when determining whether the sequence of date events is respected across all rows of the input data. The value in this list are the input for the check_date_sequence function.

Value

The cleaned input data according to the user-specified parameters. This is associated with a data cleaning report that can be accessed using attr(cleaned_data, "report")

Examples

# Parameters for column names standardization: rename all column names if
# applicable
standardize_column_names <- list(keep = NULL, rename = NULL)

# parameters to remove constant columns, empty rows and columns: remove rows
# and columns with 100% constant data
remove_constants <- list(cutoff = 1)

# Parameters for substituting missing values ("-99") with NA
replace_missing_values <- list(target_columns = NULL, na_strings = "-99")

# Parameters for duplicates removal across all columns
remove_duplicates <- list(target_columns = NULL)

# Parameters for the conversion of Date columns into "%Y-%m-%d" format
standardize_dates <- list(
  target_columns = NULL,
  error_tolerance = 0.4,
  format = NULL,
  timeframe = as.Date(c("1973-05-29", "2023-05-29")),
  orders = list(
    world_named_months = c("Ybd", "dby"),
    world_digit_months = c("dmy", "Ymd"),
    US_formats = c("Omdy", "YOmd")
  )
)

# Parameters to check whether the subject IDs comply with the expected format
standardize_subject_ids <- list(
  target_columns = "study_id",
  prefix = "PS",
  suffix = "P2",
  range = c(1, 100),
  nchar = 7
)

# convert the 'sex' column into numeric
to_numeric <- list(target_columns = "sex", lang = "en")

# the dictionary-based cleaning will not be performed here
dictionary = NULL

# no need to check if the sequence of date events is correct
check_date_sequence <- NULL

# perform the data cleaning
cleaned_data <- clean_data(
  data = readRDS(
    system.file("extdata", "test_df.RDS", package = "cleanepi")
  ),
  standardize_column_names = standardize_column_names,
  remove_constants = remove_constants,
  replace_missing_values = replace_missing_values,
  remove_duplicates = remove_duplicates,
  standardize_dates = standardize_dates,
  standardize_subject_ids = standardize_subject_ids,
  to_numeric = to_numeric,
  dictionary = NULL,
  check_date_sequence = NULL
)
#> ℹ Cleaning column names
#> ℹ Replacing missing values with NA
#> ℹ Removing constant columns and empty rows
#> ℹ Removing duplicated rows
#> ℹ No duplicates were found.
#> ℹ Standardizing Date columns
#> ! Detected 8 values that comply with multiple formats and no values that are
#>   outside of the specified time frame.
#> ℹ Enter `print_report(data = dat, "date_standardization")` to access them,
#>   where "dat" is the object used to store the output from this operation.
#> ℹ Checking subject IDs format
#> ! Detected no missing, no duplicated, and 3 incorrect subject IDs.
#> ℹ Enter `print_report(data = dat, "incorrect_subject_id")` to access them,
#>   where "dat" is the object used to store the output from this operation.
#> ℹ You can use the `correct_subject_ids()` function to correct them.
#> ℹ Converting the following  column into numeric: sex