Identify and return duplicated rows in a data frame or linelist.
Source:R/find_and_remove_duplicates.R
find_duplicates.Rd
Identify and return duplicated rows in a data frame or linelist.
Arguments
- data
A data frame or linelist.
- target_columns
A vector of columns names or indices to consider when looking for duplicates. When the input data is a
linelist
object, this parameter can be set tolinelist_tags
from which duplicates to be removed. Its default value isNULL
, which considers duplicates across all columns.
Value
A data frame or linelist of all duplicated rows with following 2 additional columns:
- row_id
The indices of the duplicated rows from the input data. Users can choose from these indices, which row they consider as redundant in each group of duplicates.
- group_id
a unique identifier associated to each group of duplicates.
Examples
dups <- find_duplicates(
data = readRDS(
system.file("extdata", "test_linelist.RDS", package = "cleanepi")
),
target_columns = c("dt_onset", "dt_report", "sex", "outcome")
)
#> ! Found 57 duplicated rows in the dataset.
#> ℹ Use `attr(dat, "report")[["duplicated_rows"]]` to access them, where "dat" is
#> the object used to store the output from this operation.