Skip to contents

Correct misspelled values by using approximate string matching techniques to compare them against the expected values.

Usage

correct_misspelled_values(
  data,
  target_columns,
  wordlist,
  max_distance = 1,
  confirm = rlang::is_interactive(),
  ...
)

Arguments

data

The input <data.frame> or <linelist>

target_columns

A <vector> of the target column names. When the input data is a <linelist> object, this parameter can be set to linelist_tags to apply the fuzzy matching exclusively to the tagged columns.

wordlist

A <vector> of characters with the words to match to the detected misspelled values.

max_distance

An <integer> for the maximum distance allowed for detecting a spelling mistakes from the wordlist. The distance is the generalized Levenshtein edit distance (see adist()). Default is 1.

confirm

A <logical> that determines whether to show the user a menu of spelling corrections. If TRUE and using R interactively then the user will have the option to review the proposed spelling corrections. This argument is useful for turning off the menu() when rlang::is_interactive() returns TRUE but not wanting to prompt the user e.g. devtools::run_examples().

...

dots Extra arguments to pass to adist().

Value

The corrected input data according to the user-specified wordlist.

Details

When used interactively (see interactive()) the user is presented a menu to ensure that the words detected using approximate string matching are not false positives and the user can decided whether to proceed with the spelling corrections. In non-interactive sessions all misspelled values are replaced by their closest values within the provided vector of expected values.

If multiple words supplied in the wordlist equally match a word in the data and confirm is TRUE the user is presented a menu to choose the replacement word. If it is not used interactively multiple equal matches throws a warning.

Examples

df <- data.frame(
  case_type = c("confirmed", "confermed", "probable", "susspected"),
  outcome = c("died", "recoverd", "did", "recovered")
)
df
#>    case_type   outcome
#> 1  confirmed      died
#> 2  confermed  recoverd
#> 3   probable       did
#> 4 susspected recovered
correct_misspelled_values(
  data = df,
  target_columns = c("case_type", "outcome"),
  wordlist = c("confirmed", "probable", "suspected", "died", "recovered"),
  confirm = FALSE
)
#>   case_type   outcome
#> 1 confirmed      died
#> 2 confirmed recovered
#> 3  probable      died
#> 4 suspected recovered