Instructor Notes

Instructor notes

Read case data


Clean case data


Instructor Note

Mediate a short discussion to relate the diagnosed characteristic with required cleaning operations.

You can use these terms to diagnose characteristics:

  • Codification, like sex and age entries using numbers, letters, and words. Also dates in different arrangement (“dd/mm/yyyy” or “yyyy/mm/dd”) and formats. Less visible, but also the column names.
  • Missing, how to interpret an entry like “” in status or “-99” in another column? do we have a data dictionary from the data collection process?
  • Inconsistencies, like having a date of sample before the date of onset.
  • Non-plausible values, like outlier observations with dates outside of an expected timeframe.
  • Duplicates, are all observations unique?

You can use these terms to relate to cleaning operations:

  • Standardize column name
  • Standardize categorical variables like sex/gender
  • Standardize date columns
  • Convert from character to numeric values
  • Check the sequence of dated events


Instructor Note

  • duplicated rows: 3, 4, 5
  • empty rows: 6
  • empty cols: 5
  • constant rows: 6
  • constant cols: 5

Notice to learners that the user can create new constant columns or rows after removing some initial ones.

R

df %>%
  cleanepi::remove_constants()

OUTPUT

Constant data was removed after 2 iterations. See the report for more details.

OUTPUT

# A tibble: 2 × 2
   col1  col2
  <dbl> <dbl>
1     1     1
2     2     3

R

df %>%
  cleanepi::remove_constants() %>%
  cleanepi::remove_constants()

OUTPUT

Constant data was removed after 2 iterations. See the report for more details.

OUTPUT

# A tibble: 2 × 2
   col1  col2
  <dbl> <dbl>
1     1     1
2     2     3


Validate case data


Instructor Note

If learners do not have an experience to share, we as instructors can share one.

An scenario like this usually happens when the institution doing the analysis is not the same as the institution collecting the data. The later can make decisions about the data structure that can affect downstream processes, impacting the time or the accuracy of the analysis results.



Aggregate and visualize