Skip to contents

Identify and return duplicated rows in a data frame or linelist.

Usage

find_duplicates(data, target_columns = NULL)

Arguments

data

The input <data.frame> or <linelist>.

target_columns

A <vector> of columns names or indices to consider when looking for duplicates. When the input data is a <linelist> object, this parameter can be set to linelist_tags from which duplicates to be removed. Its default value is NULL, which considers duplicates across all columns.

Value

A <data.frame> or <linelist> of all duplicated rows with following 2 additional columns:

row_id

The indices of the duplicated rows from the input data. Users can choose from these indices, which row they consider as redundant in each group of duplicates.

group_id

a unique identifier associated to each group of duplicates.

Examples

data <- readRDS(
  system.file("extdata", "test_linelist.RDS", package = "cleanepi")
)

# find duplicates across the following columns: "dt_onset", "dt_report",
# "sex", and "outcome"
dups <- find_duplicates(
  data = data,
  target_columns = c("dt_onset", "dt_report", "sex", "outcome")
)
#> ! Found 57 duplicated rows in the dataset.
#>  Use `print_report(dat, "found_duplicates")` to access them, where "dat" is
#>   the object used to store the output from this operation.

# print the detected duplicates
print_report(dups, "found_duplicates")
#> $duplicated_rows
#> # A tibble: 57 × 6
#> # Groups:   dt_onset, dt_report, sex, outcome [23]
#>    row_id group_id dt_onset   dt_report  sex   outcome
#>     <int>    <int> <date>     <date>     <fct> <fct>  
#>  1     26        1 2015-05-21 2015-06-03 M     Alive  
#>  2     33        1 2015-05-21 2015-06-03 M     Alive  
#>  3     55        2 2015-05-30 2015-06-06 M     Alive  
#>  4     62        2 2015-05-30 2015-06-06 M     Alive  
#>  5     23        3 2015-05-31 2015-06-02 M     Dead   
#>  6     24        3 2015-05-31 2015-06-02 M     Dead   
#>  7     99        4 2015-05-31 2015-06-09 M     Alive  
#>  8    105        4 2015-05-31 2015-06-09 M     Alive  
#>  9     27        5 2015-06-01 2015-06-03 M     Alive  
#> 10     31        5 2015-06-01 2015-06-03 M     Alive  
#> # ℹ 47 more rows
#> 
#> $duplicates_checked_from
#> [1] "dt_onset"  "dt_report" "sex"       "outcome"  
#>