Identify and return duplicated rows in a data frame or linelist.
Source:R/find_and_remove_duplicates.R
find_duplicates.Rd
Identify and return duplicated rows in a data frame or linelist.
Arguments
- data
A data frame or linelist.
- target_columns
A vector of columns names or indices to consider when looking for duplicates. When the input data is a
linelist
object, this parameter can be set tolinelist_tags
from which duplicates to be removed. Its default value isNULL
, which considers duplicates across all columns.
Value
A data frame or linelist of all duplicated rows with following 2 additional columns:
row_id
: the indices of the duplicated rows from the input data. Users can choose from these indices, which row they consider as redundant in each group of duplicates.group_id
: a unique identifier associated to each group of duplicates.
Examples
dups <- find_duplicates(
data = readRDS(
system.file("extdata", "test_linelist.RDS", package = "cleanepi")
),
target_columns = c("dt_onset", "dt_report", "sex", "outcome")
)
#> Found 57 duplicated rows in the dataset. Please consult the report for more details.