Correct misspelled values by using approximate string matching techniques to compare them against the expected values.
Source:R/correct_misspelled_values.R
correct_misspelled_values.Rd
Correct misspelled values by using approximate string matching techniques to compare them against the expected values.
Usage
correct_misspelled_values(
data,
target_columns,
wordlist,
max_distance = 1,
confirm = rlang::is_interactive(),
...
)
Arguments
- data
The input
<data.frame>
or<linelist>
- target_columns
A
<vector>
of the target column names. When the input data is a<linelist>
object, this parameter can be set tolinelist_tags
to apply the fuzzy matching exclusively to the tagged columns.- wordlist
A
<vector>
of characters with the words to match to the detected misspelled values.- max_distance
An
<integer>
for the maximum distance allowed for detecting a spelling mistakes from thewordlist
. The distance is the generalized Levenshtein edit distance (seeadist()
). Default is1
.- confirm
A
<logical>
that determines whether to show the user a menu of spelling corrections. IfTRUE
and using R interactively then the user will have the option to review the proposed spelling corrections. This argument is useful for turning off themenu()
whenrlang::is_interactive()
returnsTRUE
but not wanting to prompt the user e.g.devtools::run_examples()
.- ...
Details
When used interactively (see interactive()
) the user is presented a menu
to ensure that the words detected using approximate string matching are not
false positives and the user can decided whether to proceed with the
spelling corrections. In non-interactive sessions all misspelled values are
replaced by their closest values within the provided vector of expected
values.
If multiple words supplied in the wordlist
equally match a word in the
data and confirm
is TRUE
the user is presented a menu to choose the
replacement word. If it is not used interactively multiple equal matches
throws a warning.
Examples
df <- data.frame(
case_type = c("confirmed", "confermed", "probable", "susspected"),
outcome = c("died", "recoverd", "did", "recovered")
)
df
#> case_type outcome
#> 1 confirmed died
#> 2 confermed recoverd
#> 3 probable did
#> 4 susspected recovered
correct_misspelled_values(
data = df,
target_columns = c("case_type", "outcome"),
wordlist = c("confirmed", "probable", "suspected", "died", "recovered"),
confirm = FALSE
)
#> case_type outcome
#> 1 confirmed died
#> 2 confirmed recovered
#> 3 probable died
#> 4 suspected recovered