Skip to contents

The function checks for the existence of character columns in the data. When found, it reports back the proportion of the data types mentioned above in those columns. See the details section to know more about how it works.

Usage

scan_data(data)

Arguments

data

A data frame or linelist

Value

A data frame if the input data contains columns of type character. It invisibly returns NA otherwise. The returned data frame will have the same number of rows as the number of character columns, and six columns representing their column names, proportion of missing, numeric, date, character, and logical values.

Details

How does it work? The character columns are identified first. When there is no character column the function returns a message. For every character column, we count:

  1. the number of missing data NA

  2. the number of numeric values. A process of detecting valid dates among the numeric values is then initiated using lubridate::as_date() and date_guess() functions. If found, a warning is triggered to alert on the presence and ambiguous (numeric values that are potentially date) values. NOTE: A date is considered valid in this case if it falls within the interval of today's date and 50 years back from today.

  3. detect the Date values from the non-numeric using the date_guess() function. The date count is the sum of dates identified from numeric and non-numeric values. Because of the overlap between numeric and date, the sum across the rows in the scanning result might be greater than 1.

  4. count the logical values. The remaining values will be those of type characters.

Examples

# scan through a data frame of characters
scan_result <- scan_data(
  data = readRDS(
    system.file("extdata", "messy_data.RDS", package = "cleanepi")
  )
)
#> ! Found 50 numeric values that can also be of type Date in column `case_id`.

# scan through a data frame with two character columns
scan_result <- scan_data(
  data = readRDS(system.file("extdata", "test_linelist.RDS",
                             package = "cleanepi"))
)

# scan through a data frame with no character columns
data(iris)
iris[["fct"]] <- as.factor(sample(c("gray", "orange"), nrow(iris),
                           replace = TRUE))
iris[["lgl"]]  <- sample(c(TRUE, FALSE), nrow(iris), replace = TRUE)
iris[["date"]] <- as.Date(seq.Date(from = as.Date("2024-01-01"),
                                   to = as.Date("2024-08-30"),
                                   length.out = nrow(iris)))
iris[["posit_ct"]] <- as.POSIXct(iris[["date"]])
scan_result        <- scan_data(data = iris)
#>  No character column found from the input data.