Design Principles for {datatagr} • datatagr

This vignette outlines the design decisions that have been taken during the development of the datatagr R package, and provides some of the reasoning, and possible pros and cons of each decision.

This document is primarily intended to be read by those interested in understanding the code within the package and for potential package contributors.

Scope

datatagr provides generic labelling and validation tools. In contrast to the original versions of linelist (<=v1.1.4), datatagr functions at the variable level instead of the object level.

The validation tooling is specific to type checking variables and providing feedback on potential data loss or coercion. It does not aim to do complex validations at this time.

Naming conventions

We separate functions as much as is reasonable into their own files under R/. If there are tests available for a file under R/, it follows the convention of test-<filename>.R under tests/testthat/. Not all source code has respective tests.

We try to make function names as descriptive as possible, while keeping them short. This is to make the package easy to use and understand.

Input/Output/Interoperability

Any data frame object can be passed into datatagr. Output from datatagr remains a data frame object, with an additional datatagr class attribute. This means it remains interoperable with all the regular data frame operations one may attempt to do.

datatagr is interoperable with pipes (that is, |> or %>%). This allows for easy chaining of functions. Note that there are no guarantees that label attributes are preserved when piping or wrangling in another way. For example, dplyr drops variable level attributes when using dplyr::mutate().

Design decisions

Generic: The package is designed to be a generic tool for labelling and validating data. This is to ensure that the package can be used in a wide range of contexts and is not limited to a specific use case. Any specific use cases should be implemented in separate packages.
Local: We keep functions as local as possible. This means operations should be as precise as is feasible, to be non-destructive and ensure changes on one variable do not unexpectedly affect another. This helps ensure the package is predictable and easy to use + maintain.
Minimize number of functions: We aim to keep the number of functions in the package to a minimum. This helps usability and maintainability.
Base R: We aim to use base R functions where possible. This is to ensure that the package is lightweight and does not have many dependencies. This is for example why we do not use labelled as the labelling package.

If you feel like we did not uphold one of these design decisions, please let us know 😊

Quirks

Any package development has quirks. We outline quirks we are aware of here:

Currently, emptying labels leads to setting them to "" (empty character strings). Preferably we would end up setting them to NULL in the end.

Dependencies

checkmate - provide assertions for function arguments
lifecycle - help manage function lifecycle
rlang - ... to list parsing
tidyselect - ensure we can use pipes in has_label()

Development journey

The datatagr package is a major refactor of linelist v1.1.4. The refactor was necessary to make the package more generic and to make the codebase more maintainable. The refactor was completed in a series of steps documented in #37.