Instructor Notes

Instructor notes

Lead a short discussion to relate the diagnosed characteristics with required cleaning operations.

You can use these terms to diagnose characteristics:

Codification, like sex and age entries using numbers, letters, and words. Also dates in different arrangement (“dd/mm/yyyy” or “yyyy/mm/dd”) and formats. Less visible, but also the column names.
Missing, how to interpret an entry like “” in status or “-99” in another column? do we have a data dictionary from the data collection process?
Inconsistencies, like having a date of sample before the date of onset.
Non-plausible values, like outlier observations with dates outside of an expected timeframe.
Duplicates, are all observations unique?

You can use these terms to relate to cleaning operations:

Standardize column name
Standardize categorical variables like sex/gender
Standardize date columns
Convert from character to numeric values
Check the sequence of dated events

Instructor Note

duplicated rows: 3, 4, 5
empty rows: 6
empty cols: 5
constant rows: 6
constant cols: 5

Point out to learners that the user can create new constant columns or rows after removing some initial ones.

R

df %>%
  cleanepi::remove_constants()

OUTPUT

! Constant data was removed after 2 iterations.
ℹ Enter `attr(dat, "report")[["constant_data"]]` for more information, where
  "dat" represents the object used to store the output from
  `remove_constants()`.

OUTPUT

# A tibble: 2 × 2
   col1  col2
  <dbl> <dbl>
1     1     1
2     2     3

R

df %>%
  cleanepi::remove_constants() %>%
  cleanepi::remove_constants()

OUTPUT

! Constant data was removed after 2 iterations.
ℹ Enter `attr(dat, "report")[["constant_data"]]` for more information, where
  "dat" represents the object used to store the output from
  `remove_constants()`.

OUTPUT

# A tibble: 2 × 2
   col1  col2
  <dbl> <dbl>
1     1     1
2     2     3

Notice that cleanepi contains a set of functions to diagnose the cleaning status (e.g., check_subject_ids() and check_date_sequence() in the chunk above) and another set to perform a cleaning action (the complementary functions from the chunk above).

Validate case data

Instructor Note

If learners do not have an experience to share, we as instructors can share one.

An scenario like this usually happens when the institution doing the analysis is not the same as the institution collecting the data. The later can make decisions about the data structure that can affect downstream processes, impacting the time or the accuracy of the analysis results.

Instructor Notes

Read case data

Clean case data

Instructor Note

Instructor Note

R

OUTPUT

OUTPUT

R

OUTPUT

OUTPUT

Instructor Note

Validate case data

Instructor Note

Aggregate and visualize