Take line list output from sim_linelist()
and replace elements of
the <data.frame>
with missing values (e.g. NA
), introduce spelling
mistakes and inconsistencies, as well as coerce date types.
Arguments
- linelist
Line list
<data.frame>
output fromsim_linelist()
.- ...
<
dynamic-dots
> Named elements to replace default settings. Only if names match exactly are elements replaced, otherwise the function errors.Accepted arguments and their defaults are:
prop_missing
A
numeric
between 0 and 1 for the proportion of missing values introduced. Default is0.1
(10%).missing_value
A single atomic R object used to represent missing values. Default is
NA
.prop_spelling_mistakes
A
numeric
between 0 and 1 used to specify the proportion of spelling mistakes incharacter
columns. Default is0.1
(10%).inconsistent_sex
A
logical
boolean to specify whether the$sex
column uses"m"
and"f"
, or inconsistently uses"m"
,"f"
,"M"
,"F"
,"male"
,"female"
,"Male"
or"Female"
. Default isTRUE
so sexes are sampled from the options.sex_as_numeric
A
logical
boolean used to specify whether the values in the$sex
column should be encoded asnumeric
values (0
and1
). Default isFALSE
.sex_as_numeric
cannot beTRUE
ifinconsistent_sex = TRUE
.numeric_as_char
A
logical
boolean used to specify whethernumeric
columns should be coerced tocharacter
. Default isTRUE
.date_as_char
A
logical
boolean used to specify whetherDate
columns should be coerced tocharacter
. Default isTRUE
.inconsistent_dates
A
logical
boolean used to specify whether the values inDate
columns are inconsistently formatted (e.g."%Y-%m-%d"
,"%Y/%m/%d"
,"%d-%m-%Y"
, or"%d %B %Y"
). Default isFALSE
.prop_int_as_word
A
numeric
between 0 and 1 for the proportion of elements ininteger
columns should that are coerced towords
(seeenglish::words()
). Default is0.5
(50%).prop_duplicate_row
A
numeric
between 0 and 1 for the proportion of rows to duplicate. Default is0.01
(1%). Ifprop_duplicate_row
> 0 then it is guaranteed that at least one row will be duplicated.inconsistent_id
A
logical
boolean used to specify whether the$id
column has inconsistent formatting by appending random prefixes and suffixes to a random sample (~10%) of IDs. Default isFALSE
, so IDs are numbers (numeric
,characters
or words depending onprop_int_as_word
andnumeric_as_char
).
Details
By default messy_linelist()
:
Introduces 10% of values missing, i.e. converts to
NA
.Introduces spelling mistakes in 10% of
character
columns.Introduce inconsistency in the reporting of
$sex
.Converts
numeric
columns (double
&integer
) tocharacter
.Converts
Date
columns tocharacter
.Converts 50% of
integer
s to (English) words.Duplicates 1% of rows.
Setting missing_value
to something other than NA
will likely cause
type coercion in the line list <data.frame>
columns, most likely to
character
.
When setting sex_as_numeric
to TRUE
, male is set to 0
and female
to 1
. Only one of inconsistent_sex
or sex_as_numeric
can be TRUE
,
otherwise the function will error.
If numeric_as_char = TRUE
and sex_as_numeric = TRUE
then the sex encoded
as 0 or 1 is converted to character
. If prop_spelling_mistake
> 0 and
numeric_as_char = TRUE
the columns that are converted from numeric
to
character
do not have spelling mistakes introduced, because they are
numeric characters stored as character strings. If
prop_spelling_mistake
> 0 and date_as_char = TRUE
spelling mistakes are
not introduced into dates.
The Date
columns can be converted into an inconsistent format by
setting inconsistent_dates = TRUE
and it requires date_as_char = TRUE
,
if the latter is FALSE
the function will error.
If numeric_as_char = FALSE
and prop_int_as_word
> 0 then the integer
columns are converted to character
string (either character
numbers or
words) but the other numeric
columns are not coerced. Spelling mistakes
are not introduced into integers converted to words when
prop_spelling_mistakes
> 0 and prop_int_as_word
> 0.
Rows are duplicated after other messy modifications so the duplicated row contains identical messy elements.
Examples
linelist <- sim_linelist()
messy_linelist <- messy_linelist(linelist)
# increasing proportion of missingness to 30% with a missing value of -99
messy_linelist <- messy_linelist(
linelist,
prop_missing = 0.3,
missing_value = -99
)
# increasing proportion of spelling mistakes to 50%
messy_linelist <- messy_linelist(linelist, prop_spelling_mistakes = 0.5)
# encode `$sex` as `numeric`
messy_linelist <- messy_linelist(
linelist,
sex_as_numeric = TRUE,
inconsistent_sex = FALSE
)
# inconsistently formatted dates
messy_linelist <- messy_linelist(linelist, inconsistent_dates = TRUE)