Identify and return duplicated rows in a data frame or linelist.
Source:R/find_and_remove_duplicates.R
      find_duplicates.RdIdentify and return duplicated rows in a data frame or linelist.
Arguments
- data
 The input
<data.frame>or<linelist>.- target_columns
 A
<vector>of columns names or indices to consider when looking for duplicates. When the input data is a<linelist>object, this parameter can be set tolinelist_tagsfrom which duplicates to be removed. Its default value isNULL, which considers duplicates across all columns.
Value
The input <data.frame> or <linelist>, and adds a new
element to the report object. This is specifically a data frame with the
columns used to identify duplicates, augmented with the following two
additional columns:
- row_id
 The indices of the duplicated rows from the input data. Users can choose from these indices, which row they consider as redundant in each group of duplicates.
- group_id
 a unique identifier associated to each group of duplicates.
Examples
data <- readRDS(
  system.file("extdata", "test_linelist.RDS", package = "cleanepi")
)
# find duplicates across the following columns: "dt_onset", "dt_report",
# "sex", and "outcome"
dups <- find_duplicates(
  data = data,
  target_columns = c("dt_onset", "dt_report", "sex", "outcome")
)
#> ! Found 57 duplicated rows in the dataset.
#> ℹ Use `print_report(dat, "found_duplicates")` to access them, where "dat" is
#>   the object used to store the output from this operation.
# print the detected duplicates
print_report(dups, "found_duplicates")
#> $duplicated_rows
#> # A tibble: 57 × 6
#> # Groups:   dt_onset, dt_report, sex, outcome [23]
#>    row_id group_id dt_onset   dt_report  sex   outcome
#>     <int>    <int> <date>     <date>     <fct> <fct>  
#>  1     26        1 2015-05-21 2015-06-03 M     Alive  
#>  2     33        1 2015-05-21 2015-06-03 M     Alive  
#>  3     55        2 2015-05-30 2015-06-06 M     Alive  
#>  4     62        2 2015-05-30 2015-06-06 M     Alive  
#>  5     23        3 2015-05-31 2015-06-02 M     Dead   
#>  6     24        3 2015-05-31 2015-06-02 M     Dead   
#>  7     99        4 2015-05-31 2015-06-09 M     Alive  
#>  8    105        4 2015-05-31 2015-06-09 M     Alive  
#>  9     27        5 2015-06-01 2015-06-03 M     Alive  
#> 10     31        5 2015-06-01 2015-06-03 M     Alive  
#> # ℹ 47 more rows
#> 
#> $duplicates_checked_from
#> [1] "dt_onset"  "dt_report" "sex"       "outcome"  
#> 
# access duplicated rows only
print_report(dups, "found_duplicates")$duplicated_rows
#> # A tibble: 57 × 6
#> # Groups:   dt_onset, dt_report, sex, outcome [23]
#>    row_id group_id dt_onset   dt_report  sex   outcome
#>     <int>    <int> <date>     <date>     <fct> <fct>  
#>  1     26        1 2015-05-21 2015-06-03 M     Alive  
#>  2     33        1 2015-05-21 2015-06-03 M     Alive  
#>  3     55        2 2015-05-30 2015-06-06 M     Alive  
#>  4     62        2 2015-05-30 2015-06-06 M     Alive  
#>  5     23        3 2015-05-31 2015-06-02 M     Dead   
#>  6     24        3 2015-05-31 2015-06-02 M     Dead   
#>  7     99        4 2015-05-31 2015-06-09 M     Alive  
#>  8    105        4 2015-05-31 2015-06-09 M     Alive  
#>  9     27        5 2015-06-01 2015-06-03 M     Alive  
#> 10     31        5 2015-06-01 2015-06-03 M     Alive  
#> # ℹ 47 more rows