Compatibility with dplyr

datatagr philosophy is to prevent you from accidentally losing valuable data, but to otherwise be totally transparent and not to interfere with your workflow.

One popular ecosystem for data science workflow is the tidyverse. We try to ensure decent datatagr compatibility with the tidyverse. All dplyr verbs are tested in the tests/test-compat-dplyr.R file.

library(datatagr)
#> 
#> Attaching package: 'datatagr'
#> The following object is masked from 'package:base':
#> 
#>     labels
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

x <- make_datatagr(
  cars,
  speed = "Miles per hour",
  dist = "Distance in miles"
)

head(x)
#> 
#> // datatagr object
#>   speed dist
#> 1     4    2
#> 2     4   10
#> 3     7    4
#> 4     7   22
#> 5     8   16
#> 6     9   10
#> 
#> labelled variables:
#>  speed - Miles per hour
#>  dist - Distance in miles

Verbs operating on rows

datatagr does not modify anything regarding the behaviour for row-operations. As such, it is fully compatible with dplyr verbs operating on rows out-of-the-box. You can see in the following examples that datatagr does not produce any errors, warnings or messspeeds and its labels are conserved through dplyr operations on rows.

`dplyr::arrange()` ✅

x %>%
  arrange(speed) %>%
  head()
#> 
#> // datatagr object
#>   speed dist
#> 1     4    2
#> 2     4   10
#> 3     7    4
#> 4     7   22
#> 5     8   16
#> 6     9   10
#> 
#> labelled variables:
#>  speed - Miles per hour
#>  dist - Distance in miles

`dplyr:distinct()` ✅

x %>%
  distinct() %>%
  head()
#> 
#> // datatagr object
#>   speed dist
#> 1     4    2
#> 2     4   10
#> 3     7    4
#> 4     7   22
#> 5     8   16
#> 6     9   10
#> 
#> labelled variables:
#>  speed - Miles per hour
#>  dist - Distance in miles

`dplyr::filter()` ✅

x %>%
  filter(speed >= 50) %>%
  head()
#> 
#> // datatagr object
#> [1] speed dist 
#> <0 rows> (or 0-length row.names)
#> 
#> labelled variables:
#>  speed - Miles per hour
#>  dist - Distance in miles

`dplyr::slice()` ✅

x %>%
  slice(5:10)
#> 
#> // datatagr object
#>   speed dist
#> 1     8   16
#> 2     9   10
#> 3    10   18
#> 4    10   26
#> 5    10   34
#> 6    11   17
#> 
#> labelled variables:
#>  speed - Miles per hour
#>  dist - Distance in miles

x %>%
  slice_head(n = 5)
#> 
#> // datatagr object
#>   speed dist
#> 1     4    2
#> 2     4   10
#> 3     7    4
#> 4     7   22
#> 5     8   16
#> 
#> labelled variables:
#>  speed - Miles per hour
#>  dist - Distance in miles

x %>%
  slice_tail(n = 5)
#> 
#> // datatagr object
#>   speed dist
#> 1    24   70
#> 2    24   92
#> 3    24   93
#> 4    24  120
#> 5    25   85
#> 
#> labelled variables:
#>  speed - Miles per hour
#>  dist - Distance in miles

x %>%
  slice_min(speed, n = 3)
#> 
#> // datatagr object
#>   speed dist
#> 1     4    2
#> 2     4   10
#> 3     7    4
#> 4     7   22
#> 
#> labelled variables:
#>  speed - Miles per hour
#>  dist - Distance in miles

x %>%
  slice_max(speed, n = 3)
#> 
#> // datatagr object
#>   speed dist
#> 1    25   85
#> 2    24   70
#> 3    24   92
#> 4    24   93
#> 5    24  120
#> 
#> labelled variables:
#>  speed - Miles per hour
#>  dist - Distance in miles

x %>%
  slice_sample(n = 5)
#> 
#> // datatagr object
#>   speed dist
#> 1    23   54
#> 2    14   80
#> 3    12   14
#> 4    19   46
#> 5    19   68
#> 
#> labelled variables:
#>  speed - Miles per hour
#>  dist - Distance in miles

Verbs operating on columns

During operations on columns, datatagr will:

stay invisible and conserve labels if no labelled column is affected by the operation
trigger lost_labels_action() if labelled columns are affected by the operation

`dplyr::mutate()` ✓ (partial)

There is an incomplete compatibility with dplyr::mutate() in that simple renames without any actual modification of the column don’t update the labels. In this scenario, users should rather use dplyr::rename()

Although dplyr::mutate() is not able to leverspeed to full power of datatagr labels, datatagr objects behave as expected the same way a data.frame would:

# In place modification doesn't lose labels
x %>%
  mutate(speed = speed + 10) %>%
  head()
#> 
#> // datatagr object
#>   speed dist
#> 1    14    2
#> 2    14   10
#> 3    17    4
#> 4    17   22
#> 5    18   16
#> 6    19   10
#> 
#> labelled variables:
#>  speed - Miles per hour
#>  dist - Distance in miles

# New columns don't affect existing labels
x %>%
  mutate(ticket = speed >= 50) %>%
  head()
#> 
#> // datatagr object
#>   speed dist ticket
#> 1     4    2  FALSE
#> 2     4   10  FALSE
#> 3     7    4  FALSE
#> 4     7   22  FALSE
#> 5     8   16  FALSE
#> 6     9   10  FALSE
#> 
#> labelled variables:
#>  speed - Miles per hour
#>  dist - Distance in miles

# .keep = "unused" generate expected tag loss conditions
x %>%
  mutate(edad = speed, .keep = "unused") %>%
  head()
#> Warning: The following labelled variables are lost:
#>  speed - Miles per hour
#> 
#> // datatagr object
#>   dist edad
#> 1    2    4
#> 2   10    4
#> 3    4    7
#> 4   22    7
#> 5   16    8
#> 6   10    9
#> 
#> labelled variables:
#>  dist - Distance in miles
#>  edad - Miles per hour

`dplyr::pull()` ✅

dplyr::pull() returns a vector, which results, as expected, in the loss of the datatagr class and labels:

x %>%
  pull(speed)
#>  [1]  4  4  7  7  8  9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15
#> [26] 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25
#> attr(,"label")
#> [1] "Miles per hour"

`dplyr::relocate()` ✅

x %>%
  relocate(speed, .before = 1) %>%
  head()
#> 
#> // datatagr object
#>   speed dist
#> 1     4    2
#> 2     4   10
#> 3     7    4
#> 4     7   22
#> 5     8   16
#> 6     9   10
#> 
#> labelled variables:
#>  speed - Miles per hour
#>  dist - Distance in miles

`dplyr::rename()` & `dplyr::rename_with()` ✅

dplyr::rename() is fully compatible out-of-the-box with datatagr, meaning that labels will be updated at the same time that columns are renamed. This is possibly because it uses names<-() under the hood, which datatagr provides a custom names<-.datatagr() method for:

x %>%
  rename(edad = speed) %>%
  head()
#> 
#> // datatagr object
#>   edad dist
#> 1    4    2
#> 2    4   10
#> 3    7    4
#> 4    7   22
#> 5    8   16
#> 6    9   10
#> 
#> labelled variables:
#>  edad - Miles per hour
#>  dist - Distance in miles

x %>%
  rename_with(toupper) %>%
  head()
#> 
#> // datatagr object
#>   SPEED DIST
#> 1     4    2
#> 2     4   10
#> 3     7    4
#> 4     7   22
#> 5     8   16
#> 6     9   10
#> 
#> labelled variables:
#>  SPEED - Miles per hour
#>  DIST - Distance in miles

`dplyr::select()` ✅

dplyr::select() is fully compatible with datatagr, including when columns are renamed in a select():

# Works fine
x %>%
  select(speed, dist) %>%
  head()
#> 
#> // datatagr object
#>   speed dist
#> 1     4    2
#> 2     4   10
#> 3     7    4
#> 4     7   22
#> 5     8   16
#> 6     9   10
#> 
#> labelled variables:
#>  speed - Miles per hour
#>  dist - Distance in miles

# labels are updated!
x %>%
  select(dist, edad = speed) %>%
  head()
#> 
#> // datatagr object
#>   dist edad
#> 1    2    4
#> 2   10    4
#> 3    4    7
#> 4   22    7
#> 5   16    8
#> 6   10    9
#> 
#> labelled variables:
#>  dist - Distance in miles
#>  edad - Miles per hour

Verbs operating on groups ✘

Groups are not yet supported. Applying any verb operating on group to a datatagr will silently convert it back to a data.frame or tibble.

Verbs operating on data.frames

`dplyr::bind_rows()` ✅

dim(x)
#> [1] 50  2

dim(bind_rows(x, x))
#> [1] 100   2

`dplyr::bind_cols()` ✘

bind_cols() is currently incompatible with datatagr:

labels from the second element are lost
Warnings are produced about lost labels, even for labels that are not actually lost

bind_cols(
  suppressWarnings(select(x, speed)),
  suppressWarnings(select(x, dist))
) %>%
  head()
#> Warning: The following labelled variables are lost:
#>  speed - Miles per hour
#> Warning: The following labelled variables are lost:
#>  dist - Distance in miles
#> 
#> // datatagr object
#>   speed dist
#> 1     4    2
#> 2     4   10
#> 3     7    4
#> 4     7   22
#> 5     8   16
#> 6     9   10
#> 
#> labelled variables:
#>  speed - Miles per hour
#>  dist - Distance in miles

Joins ✘

Joins are currently not compatible with datatagr as labels from the second element are silently dropped.

full_join(
  suppressWarnings(select(x, speed, dist)),
  suppressWarnings(select(x, dist, speed))
) %>%
  head()
#> Joining with `by = join_by(speed, dist)`
#> Warning in full_join(suppressWarnings(select(x, speed, dist)), suppressWarnings(select(x, : Detected an unexpected many-to-many relationship between `x` and `y`.
#> ℹ Row 17 of `x` matches multiple rows in `y`.
#> ℹ Row 17 of `y` matches multiple rows in `x`.
#> ℹ If a many-to-many relationship is expected, set `relationship =
#>   "many-to-many"` to silence this warning.
#> 
#> // datatagr object
#>   speed dist
#> 1     4    2
#> 2     4   10
#> 3     7    4
#> 4     7   22
#> 5     8   16
#> 6     9   10
#> 
#> labelled variables:
#>  speed - Miles per hour
#>  dist - Distance in miles

Verbs operating on multiple columns

`dplyr::pick()` ✘

pick() makes tidyselect functions work in usually tidyselect-incompatible functions, such as:

x %>%
  dplyr::arrange(dplyr::pick(ends_with("loc"))) %>%
  head()
#> 
#> // datatagr object
#>   speed dist
#> 1     4    2
#> 2     4   10
#> 3     7    4
#> 4     7   22
#> 5     8   16
#> 6     9   10
#> 
#> labelled variables:
#>  speed - Miles per hour
#>  dist - Distance in miles

As such, we could expect it to work with datatagr custom tidyselect-like function: has_label() but it’s not the case since pick() currently strips out all attributes, including the datatagr class and all labels. This unclassing is documented in ?pick:

pick() returns a data frame containing the selected columns for the current group.

Verbs operating on rows

dplyr::arrange() ✅

dplyr:distinct() ✅

dplyr::filter() ✅

dplyr::slice() ✅

Verbs operating on columns

dplyr::mutate() ✓ (partial)

dplyr::pull() ✅

dplyr::relocate() ✅

dplyr::rename() & dplyr::rename_with() ✅

dplyr::select() ✅