Skip to contents

Cleans up messy data frames by performing several operations. These include among others: cleaning of column names, detecting and removing duplicates, empty records and columns, constant columns, replacing missing values by NA, converting character columns into dates when they contain a certain number of date values, detecting subject IDs with wrong formats, etc.

Usage

clean_data(data, params = NULL)

Arguments

data

The input data frame or linelist

params

A list of parameters that define what cleaning operations will be applied on the input data. The default parameters are defined in cleanepi::default_cleanepi_settings(). The possible values are:

  1. standardize_column_names: A list with the arguments needed to standardize the column names. The elements of this list will be the input for the standardize_column_names() function.

  2. replace_missing_values: A list of parameters to be used when replacing the missing values by NA. The elements of the list are the inputs for the replace_missing_values() function.

  3. remove_duplicates: A list with the arguments that define the columns and other parameters to be considered when looking for duplicates. They are the input values for the remove_duplicates() function.

  4. remove_constants: A list with the parameters that define whether to remove constant data or not. The values are the input for the remove_constants() function.

  5. standardize_dates: A list of parameters that will be used to standardize the date values from the input data. They represent the input values for the standardize_dates() function.

  6. standardize_subject_ids: A list of parameters that are needed to check the IDs that comply with the expect format. These arguments are the input values of the check_subject_ids().

  7. to_numeric: a vector of column names. When provided, the values in these columns will be converted into numeric.

  8. dictionary: A data frame that will be used to substitute the current values in the specified columns the those in the dictionary. It is the main argument for the clean_using_dictionary() function.

  9. check_date_sequence: A list of arguments to be used when determining whether the sequence of date events is respected across all rows of the input data. The value in this list are the input for the check_date_sequence() function.

  10. span: A list with the parameters that define how the time span will be calculated between 2 columns of between a column and a Date value. These arguments will be used in the span() function.

Value

The cleaned input date according to the user-specified parameters. This is associated with a data cleaning report that can be accessed using attr(cleaned_data, "report")

Examples

# Parameters for column names standardization
standardize_col_names <- list(keep = NULL, rename = NULL)

# parameters to remove constant columns, empty rows and columns
remove_cte <- list(cutoff = 1)

# Parameters for substituting missing values with NA:
replace_missing_values <- list(target_columns = NULL, na_strings = "-99")

# Parameters for duplicates removal across all columns
remove_duplicates <- list(target_columns   = NULL)

# Parameters for dates standardization
standardize_dates <- list(target_columns  = NULL,
                          error_tolerance = 0.4,
                          format          = NULL,
                          timeframe       = as.Date(c("1973-05-29",
                                                      "2023-05-29")),
                          orders          = list(
                            world_named_months = c("Ybd", "dby"),
                            world_digit_months = c("dmy", "Ymd"),
                            US_formats         = c("Omdy", "YOmd")
                          ),
                          modern_excel    = TRUE)

# Parameters for subject IDs standardization
standardize_subject_ids <- list(target_columns = "study_id",
                                prefix         = "PS",
                                suffix         = "P2",
                                range          = c(1, 100),
                                nchar          = 7)

to_numeric <- list(target_columns = "sex", lang = "en")


# dictionary = NULL the dictionary-based cleaning will not be performed here

cleaned_data <- clean_data(
  data   = readRDS(system.file("extdata", "test_df.RDS",
                               package = "cleanepi")),
  params = list(
    standardize_column_names = standardize_col_names,
    remove_constants         = remove_cte,
    replace_missing_values   = replace_missing_values,
    remove_duplicates        = remove_duplicates,
    standardize_dates        = standardize_dates,
    standardize_subject_ids  = standardize_subject_ids,
    to_numeric               = to_numeric,
    dictionary               = NULL
  )
)
#> 
#> cleaning column names
#> replacing missing values with NA
#> removing the constant columns, empty rows and columns
#> removing duplicated rows
#> standardising date columns
#> checking subject IDs format
#> Warning: Detected incorrect subject ids at lines: 3, 5, 7
#> Use the correct_subject_ids() function to adjust them.
#> converting sex, en into numeric