Skip to contents

When the format of the values in a column and/or the target columns are not defined, we strongly recommend checking a few converted dates manually to make sure that the dates extracted from a character vector or a factor are correct.

Usage

standardize_dates(
  data,
  target_columns = NULL,
  format = NULL,
  timeframe = NULL,
  error_tolerance = 0.5,
  orders = NULL
)

Arguments

data

A data frame or linelist

target_columns

A vector of the target date column names. When the input data is a linelist object, this parameter can be set to linelist_tags if you wish to standardize the date columns across tagged columns only. Default is NULL.

format

A vector of the expected formats in the date values from the date columns. Default is NULL.

timeframe

A vector of 2 values of type date. If provided, date values that do not fall within this timeframe will be set to NA.

error_tolerance

A number between 0 and 1 indicating the proportion of entries which cannot be identified as dates to be tolerated; if this proportion is exceeded, the original vector is returned, and a message is issued; defaults to 0.4 (40 percent).

orders

A list or character vector with the date codes for fine-grained parsing of dates. This allows for parsing of mixed dates. If a list is supplied, that list will be used for successive tries in parsing. When this is not provided (orders = NULL), the function will use the following order defined in the guesser:

list(
  quarter_partial_dates = c("Y", "Ym", "Yq"),
  world_digit_months = c("Yq", "ymd", "ydm", "dmy", "mdy", "myd", "dym",
                         "Ymd", "Ydm", "dmY", "mdY", "mYd", "dYm"),
  world_named_months = c("dby", "dyb", "bdy", "byd", "ybd", "ydb",
                         "dbY", "dYb", "bdY", "bYd", "Ybd", "Ydb"),
  us_format = c("Omdy", "YOmd")
)

Value

The input dataset where the date columns have been standardized. The date values that are out of the specified timeframe will be reported in the report. Similarly, date values that comply with multiple formats will also be featured in the report object.

Details

Check for the presence of date values that could have multiple formats from the $multi_format_dates element of the report.

Converting ambiguous character strings to dates is difficult for many reasons:

  • dates may not use the standard Ymd format

  • within the same variable, dates may follow different formats

  • dates may be mixed with things that are not dates

  • the behavior of as.Date in the presence of non-date is hard to predict, sometimes returning NA, sometimes issuing an error.

This function tries to address all the above issues. Dates with the following format should be automatically detected, irrespective of separators (e.g. "-", " ", "/") and surrounding text:

  • "19 09 2018"

  • "2018 09 19"

  • "19 Sep 2018"

  • "2018 Sep 19"

  • "Sep 19 2018"

How it works

This function relies heavily on lubridate::parse_date_time(), which is an extremely flexible date parser that works well for consistent date formats, but can quickly become unwieldy and may produce spurious results. standardize_dates() will use a list of formats in the orders argument to run parse_date_time() with each format vector separately and take the first correctly parsed date from all the trials.

With the default orders shown above, the dates 03 Jan 2018, 07/03/1982, and 08/20/85 are correctly interpreted as 2018-01-03, 1982-03-07, and 1985-08-20. The examples section will show how you can manipulate the orders to be customized for your situation.

Examples

x <- c("03 Jan 2018", "07/03/1982", "08/20/85")
# The below will coerce values where the month is written in letters only
# into Date.
as.Date(lubridate::parse_date_time(x, orders = c("Ybd", "dby")))
#> Warning:  2 failed to parse.
#> [1] "2018-01-03" NA           NA          

# coerce values where the month is written in letters or numbers into Date.
as.Date(lubridate::parse_date_time(x, orders = c("dmy", "Ymd")))
#> Warning:  1 failed to parse.
#> [1] "2018-01-03" "1982-03-07" NA          

# How to use standardize_dates()
dat <- standardize_dates(
  data = readRDS(
    system.file("extdata", "test_df.RDS", package = "cleanepi")
  ),
  target_columns = "date_first_pcr_positive_test",
  format = NULL,
  timeframe = NULL,
  error_tolerance = 0.4,
  orders = list(
    world_named_months = c("Ybd", "dby"),
    world_digit_months = c("dmy", "Ymd"),
    US_format = c("Omdy", "YOmd")
  )
)