When removing duplicates, users can specify a set columns to consider with
the target_columns
argument.
Value
The input data <data.frame>
or <linelist>
without the
duplicated rows identified from all or the specified columns.
Details
Caveat: In many epidemiological datasets, multiple rows may share the same value in one or more columns without being true duplicates. For example, several individuals might have the same symptom onset date and admission date. Be cautious when using this function—especially when applying it to a single target column—to avoid incorrect identification or removal of valid entries.
Examples
data <- readRDS(
system.file("extdata", "test_linelist.RDS", package = "cleanepi")
)
no_dups <- remove_duplicates(
data = data,
target_columns = "linelist_tags"
)
#> ! Found 57 duplicated rows in the dataset.
#> ℹ Use `print_report(dat, "found_duplicates")` to access them, where "dat" is
#> the object used to store the output from this operation.
# print the removed duplicates
print_report(no_dups, "removed_duplicates")
#> # A tibble: 34 × 5
#> # Groups: dt_onset, dt_report, sex, outcome [23]
#> row_id dt_onset dt_report sex outcome
#> <int> <date> <date> <fct> <fct>
#> 1 33 2015-05-21 2015-06-03 M Alive
#> 2 62 2015-05-30 2015-06-06 M Alive
#> 3 24 2015-05-31 2015-06-02 M Dead
#> 4 105 2015-05-31 2015-06-09 M Alive
#> 5 31 2015-06-01 2015-06-03 M Alive
#> 6 60 2015-06-01 2015-06-06 F Alive
#> 7 73 2015-06-01 2015-06-07 F Alive
#> 8 78 2015-06-01 2015-06-07 F Alive
#> 9 82 2015-06-01 2015-06-07 F Alive
#> 10 85 2015-06-01 2015-06-07 F Alive
#> # ℹ 24 more rows
# print the detected duplicates
print_report(no_dups, "found_duplicates")
#> $duplicated_rows
#> # A tibble: 57 × 6
#> # Groups: dt_onset, dt_report, sex, outcome [23]
#> row_id group_id dt_onset dt_report sex outcome
#> <int> <int> <date> <date> <fct> <fct>
#> 1 26 1 2015-05-21 2015-06-03 M Alive
#> 2 33 1 2015-05-21 2015-06-03 M Alive
#> 3 55 2 2015-05-30 2015-06-06 M Alive
#> 4 62 2 2015-05-30 2015-06-06 M Alive
#> 5 23 3 2015-05-31 2015-06-02 M Dead
#> 6 24 3 2015-05-31 2015-06-02 M Dead
#> 7 99 4 2015-05-31 2015-06-09 M Alive
#> 8 105 4 2015-05-31 2015-06-09 M Alive
#> 9 27 5 2015-06-01 2015-06-03 M Alive
#> 10 31 5 2015-06-01 2015-06-03 M Alive
#> # ℹ 47 more rows
#>
#> $duplicates_checked_from
#> [1] "dt_onset" "dt_report" "sex" "outcome"
#>