Remove duplicates — remove_duplicates • cleanepi

When removing duplicates, users can specify a set columns to consider with the target_columns argument.

Usage

remove_duplicates(data, target_columns = NULL)

Arguments

data: The input <data.frame> or <linelist>.
target_columns: A <vector> of column names to use when looking for duplicates. When the input data is a linelist object, this parameter can be set to linelist_tags if you wish to look for duplicates on tagged columns only. Default is NULL.

Value

The input data <data.frame> or <linelist> without the duplicated rows identified from all or the specified columns.

Details

Caveat: In many epidemiological datasets, multiple rows may share the same value in one or more columns without being true duplicates. For example, several individuals might have the same symptom onset date and admission date. Be cautious when using this function—especially when applying it to a single target column—to avoid incorrect identification or removal of valid entries.

Examples

data <- readRDS(
  system.file("extdata", "test_linelist.RDS", package = "cleanepi")
)
no_dups <- remove_duplicates(
  data = data,
  target_columns = "linelist_tags"
)
#> ! Found 57 duplicated rows in the dataset.
#> ℹ Use `print_report(dat, "found_duplicates")` to access them, where "dat" is
#>   the object used to store the output from this operation.

# print the removed duplicates
print_report(no_dups, "removed_duplicates")
#> # A tibble: 34 × 5
#> # Groups:   dt_onset, dt_report, sex, outcome [23]
#>    row_id dt_onset   dt_report  sex   outcome
#>     <int> <date>     <date>     <fct> <fct>  
#>  1     33 2015-05-21 2015-06-03 M     Alive  
#>  2     62 2015-05-30 2015-06-06 M     Alive  
#>  3     24 2015-05-31 2015-06-02 M     Dead   
#>  4    105 2015-05-31 2015-06-09 M     Alive  
#>  5     31 2015-06-01 2015-06-03 M     Alive  
#>  6     60 2015-06-01 2015-06-06 F     Alive  
#>  7     73 2015-06-01 2015-06-07 F     Alive  
#>  8     78 2015-06-01 2015-06-07 F     Alive  
#>  9     82 2015-06-01 2015-06-07 F     Alive  
#> 10     85 2015-06-01 2015-06-07 F     Alive  
#> # ℹ 24 more rows

# print the detected duplicates
print_report(no_dups, "found_duplicates")
#> $duplicated_rows
#> # A tibble: 57 × 6
#> # Groups:   dt_onset, dt_report, sex, outcome [23]
#>    row_id group_id dt_onset   dt_report  sex   outcome
#>     <int>    <int> <date>     <date>     <fct> <fct>  
#>  1     26        1 2015-05-21 2015-06-03 M     Alive  
#>  2     33        1 2015-05-21 2015-06-03 M     Alive  
#>  3     55        2 2015-05-30 2015-06-06 M     Alive  
#>  4     62        2 2015-05-30 2015-06-06 M     Alive  
#>  5     23        3 2015-05-31 2015-06-02 M     Dead   
#>  6     24        3 2015-05-31 2015-06-02 M     Dead   
#>  7     99        4 2015-05-31 2015-06-09 M     Alive  
#>  8    105        4 2015-05-31 2015-06-09 M     Alive  
#>  9     27        5 2015-06-01 2015-06-03 M     Alive  
#> 10     31        5 2015-06-01 2015-06-03 M     Alive  
#> # ℹ 47 more rows
#> 
#> $duplicates_checked_from
#> [1] "dt_onset"  "dt_report" "sex"       "outcome"  
#>