Instructor Notes

Instructor notes

Read case data


Clean case data


Instructor Note

Lead a short discussion to relate the diagnosed characteristics with required cleaning operations.

You can use the following terms to diagnose characteristics:

  • Codification, like the codification of values in columns like ‘gender’ and ‘age’ using numbers, letters, and words. Also the presence of multiple dates formats (“dd/mm/yyyy”, “yyyy/mm/dd”, etc) in the same column like in ‘date_onset’. Less visible, but also the column names.
  • Missing, how to interpret an entry like “” in the ‘status’ column or “-99” in other circumstances? Do we have a data dictionary from the data collection process?
  • Inconsistencies, like having a date of sample before the date of onset.
  • Non-plausible values, like observations where some dates values are outside of the expected timeframe.
  • Duplicates, are all observations unique?

You can use these terms to relate to cleaning operations:

  • Standardize column name
  • Standardize categorical variables like ‘gender’
  • Standardize date columns
  • Convert character values into numeric
  • Check the sequence of dated events


Instructor Note

Make sure they start by removing duplicates before removing constant data.

  • indices of duplicated rows: 3, 4, 5
  • indices of empty rows: 4 (from the first iteration); 3 (from the second iteration)
  • empty cols: “col5”
  • constant cols: “col3”, and “col4”

Point out to learners that they create a different set of constant data after removing by varying the value of the cutoff argument.

R

df <- df %>% cleanepi::remove_constants(cutoff = 0.5)


Instructor Note

Notice that cleanepi contains a set of functions to diagnose the cleaning status (e.g., check_subject_ids() and check_date_sequence() in the chunk above) and another set to perform a cleaning action (the complementary functions from the chunk above).



Validate case data


Instructor Note

If learners do not have an experience to share, we as instructors can share one.

A scenario like this usually happens when the institution doing the analysis is not the same as the institution collecting the data. The later can make decisions about the data structure that can affect downstream processes, impacting the time or the accuracy of the analysis results.



Aggregate and visualize