Take line list output from sim_linelist() and replace elements of
the <data.frame> with missing values (e.g. NA), introduce spelling
mistakes and inconsistencies, as well as coerce date types.
Arguments
- linelist
Line list
<data.frame>output fromsim_linelist().- ...
<
dynamic-dots> Named elements to replace default settings. Only if names match exactly are elements replaced, otherwise the function errors.Accepted arguments and their defaults are:
prop_missingA
numericbetween 0 and 1 for the proportion of missing values introduced. Default is0.1(10%).missing_valueA vector with the missing value(s). If multiple values are supplied a missing value is randomly sampled for each cell in the line list. Default is
NA.prop_spelling_mistakesA
numericbetween 0 and 1 used to specify the proportion of spelling mistakes incharactercolumns. Default is0.1(10%).inconsistent_sexA
logicalboolean to specify whether the$sexcolumn uses"m"and"f", or inconsistently uses"m","f","M","F","male","female","Male"or"Female". Default isTRUEso sexes are sampled from the options.sex_as_numericA
logicalboolean used to specify whether the values in the$sexcolumn should be encoded asnumericvalues (0and1). Default isFALSE.sex_as_numericcannot beTRUEifinconsistent_sex = TRUE.numeric_as_charA
logicalboolean used to specify whethernumericcolumns should be coerced tocharacter. Default isTRUE.date_as_charA
logicalboolean used to specify whetherDatecolumns should be coerced tocharacter. Default isTRUE.inconsistent_datesA
logicalboolean used to specify whether the values inDatecolumns are inconsistently formatted (e.g."%Y-%m-%d","%Y/%m/%d","%d-%m-%Y", or"%d %B %Y"). Default isFALSE.prop_int_as_wordA
numericbetween 0 and 1 for the proportion of elements inintegercolumns should that are coerced towords(seeenglish::words()). Default is0.5(50%).prop_duplicate_rowA
numericbetween 0 and 1 for the proportion of rows to duplicate. Default is0.01(1%). Ifprop_duplicate_row> 0 then it is guaranteed that at least one row will be duplicated.inconsistent_idA
logicalboolean used to specify whether the$idcolumn has inconsistent formatting by appending random prefixes and suffixes to a random sample (~10%) of IDs. Default isFALSE, so IDs are numbers (numeric,charactersor words depending onprop_int_as_wordandnumeric_as_char).
Value
A messy line list <data.frame>.
The output <data.frame> has the same structure as the input <data.frame>
from sim_linelist(), with messy entries.
Details
By default messy_linelist():
Introduces 10% of values missing, i.e. converts to
NA.Introduces spelling mistakes in 10% of
charactercolumns.Introduce inconsistency in the reporting of
$sex.Converts
numericcolumns (double&integer) tocharacter.Converts
Datecolumns tocharacter.Converts 50% of
integers to (English) words.Duplicates 1% of rows.
Setting missing_value to something other than NA will likely cause
type coercion in the line list <data.frame> columns, most likely to
character.
When setting sex_as_numeric to TRUE, male is set to 0 and female
to 1. Only one of inconsistent_sex or sex_as_numeric can be TRUE,
otherwise the function will error.
If numeric_as_char = TRUE and sex_as_numeric = TRUE then the sex encoded
as 0 or 1 is converted to character. If prop_spelling_mistake > 0 and
numeric_as_char = TRUE the columns that are converted from numeric to
character do not have spelling mistakes introduced, because they are
numeric characters stored as character strings. If
prop_spelling_mistake > 0 and date_as_char = TRUE spelling mistakes are
not introduced into dates.
The Date columns can be converted into an inconsistent format by
setting inconsistent_dates = TRUE and it requires date_as_char = TRUE,
if the latter is FALSE the function will error.
If numeric_as_char = FALSE and prop_int_as_word > 0 then the integer
columns are converted to character string (either character numbers or
words) but the other numeric columns are not coerced. Spelling mistakes
are not introduced into integers converted to words when
prop_spelling_mistakes > 0 and prop_int_as_word > 0.
Rows are duplicated after other messy modifications so the duplicated row contains identical messy elements.
Examples
linelist <- sim_linelist()
messy_linelist <- messy_linelist(linelist)
# increasing proportion of missingness to 30% with a missing value of -99
messy_linelist <- messy_linelist(
linelist,
prop_missing = 0.3,
missing_value = -99
)
# increasing proportion of spelling mistakes to 50%
messy_linelist <- messy_linelist(linelist, prop_spelling_mistakes = 0.5)
# encode `$sex` as `numeric`
messy_linelist <- messy_linelist(
linelist,
sex_as_numeric = TRUE,
inconsistent_sex = FALSE
)
# inconsistently formatted dates
messy_linelist <- messy_linelist(linelist, inconsistent_dates = TRUE)