Identify and return duplicated rows in a data frame or linelist.
Source:R/find_and_remove_duplicates.R
find_duplicates.Rd
Identify and return duplicated rows in a data frame or linelist.
Arguments
- data
The input
<data.frame>
or<linelist>
.- target_columns
A
<vector>
of columns names or indices to consider when looking for duplicates. When the input data is a<linelist>
object, this parameter can be set tolinelist_tags
from which duplicates to be removed. Its default value isNULL
, which considers duplicates across all columns.
Value
A <data.frame>
or <linelist>
of all duplicated rows
with following 2 additional columns:
- row_id
The indices of the duplicated rows from the input data. Users can choose from these indices, which row they consider as redundant in each group of duplicates.
- group_id
a unique identifier associated to each group of duplicates.
Examples
data <- readRDS(
system.file("extdata", "test_linelist.RDS", package = "cleanepi")
)
# find duplicates across the following columns: "dt_onset", "dt_report",
# "sex", and "outcome"
dups <- find_duplicates(
data = data,
target_columns = c("dt_onset", "dt_report", "sex", "outcome")
)
#> ! Found 57 duplicated rows in the dataset.
#> ℹ Use `print_report(dat, "found_duplicates")` to access them, where "dat" is
#> the object used to store the output from this operation.
# print the detected duplicates
print_report(dups, "found_duplicates")
#> $duplicated_rows
#> # A tibble: 57 × 6
#> # Groups: dt_onset, dt_report, sex, outcome [23]
#> row_id group_id dt_onset dt_report sex outcome
#> <int> <int> <date> <date> <fct> <fct>
#> 1 26 1 2015-05-21 2015-06-03 M Alive
#> 2 33 1 2015-05-21 2015-06-03 M Alive
#> 3 55 2 2015-05-30 2015-06-06 M Alive
#> 4 62 2 2015-05-30 2015-06-06 M Alive
#> 5 23 3 2015-05-31 2015-06-02 M Dead
#> 6 24 3 2015-05-31 2015-06-02 M Dead
#> 7 99 4 2015-05-31 2015-06-09 M Alive
#> 8 105 4 2015-05-31 2015-06-09 M Alive
#> 9 27 5 2015-06-01 2015-06-03 M Alive
#> 10 31 5 2015-06-01 2015-06-03 M Alive
#> # ℹ 47 more rows
#>
#> $duplicates_checked_from
#> [1] "dt_onset" "dt_report" "sex" "outcome"
#>