Compatibility with dplyr

safeframe philosophy is to prevent you from accidentally losing valuable data, but to otherwise be totally transparent and not to interfere with your workflow.

One popular ecosystem for data science workflow is the tidyverse. We try to ensure decent safeframe compatibility with the tidyverse. All dplyr verbs are tested in the tests/test-compat-dplyr.R file.

library(safeframe)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

x <- make_safeframe(
  cars,
  mph = "speed",
  distance = "dist"
)

head(x)
#> 
#> // safeframe object
#>   speed dist
#> 1     4    2
#> 2     4   10
#> 3     7    4
#> 4     7   22
#> 5     8   16
#> 6     9   10
#> 
#> tagged variables:
#>  mph - speed
#>  distance - dist

Verbs operating on rows

safeframe does not modify anything regarding the behaviour for row-operations. As such, it is fully compatible with dplyr verbs operating on rows out-of-the-box. You can see in the following examples that safeframe does not produce any errors, warnings or messspeeds and its tags are conserved through dplyr operations on rows.

`dplyr::arrange()` ✅

x %>%
  arrange(speed) %>%
  head()
#> 
#> // safeframe object
#>   speed dist
#> 1     4    2
#> 2     4   10
#> 3     7    4
#> 4     7   22
#> 5     8   16
#> 6     9   10
#> 
#> tagged variables:
#>  mph - speed
#>  distance - dist

`dplyr:distinct()` ✅

x %>%
  distinct() %>%
  head()
#> 
#> // safeframe object
#>   speed dist
#> 1     4    2
#> 2     4   10
#> 3     7    4
#> 4     7   22
#> 5     8   16
#> 6     9   10
#> 
#> tagged variables:
#>  mph - speed
#>  distance - dist

`dplyr::filter()` ✅

x %>%
  filter(speed >= 5) %>%
  head()
#> 
#> // safeframe object
#>   speed dist
#> 1     7    4
#> 2     7   22
#> 3     8   16
#> 4     9   10
#> 5    10   18
#> 6    10   26
#> 
#> tagged variables:
#>  mph - speed
#>  distance - dist

`dplyr::slice()` ✅

x %>%
  slice(5:10)
#> 
#> // safeframe object
#>   speed dist
#> 1     8   16
#> 2     9   10
#> 3    10   18
#> 4    10   26
#> 5    10   34
#> 6    11   17
#> 
#> tagged variables:
#>  mph - speed
#>  distance - dist

x %>%
  slice_head(n = 5)
#> 
#> // safeframe object
#>   speed dist
#> 1     4    2
#> 2     4   10
#> 3     7    4
#> 4     7   22
#> 5     8   16
#> 
#> tagged variables:
#>  mph - speed
#>  distance - dist

x %>%
  slice_tail(n = 5)
#> 
#> // safeframe object
#>   speed dist
#> 1    24   70
#> 2    24   92
#> 3    24   93
#> 4    24  120
#> 5    25   85
#> 
#> tagged variables:
#>  mph - speed
#>  distance - dist

x %>%
  slice_min(speed, n = 3)
#> 
#> // safeframe object
#>   speed dist
#> 1     4    2
#> 2     4   10
#> 3     7    4
#> 4     7   22
#> 
#> tagged variables:
#>  mph - speed
#>  distance - dist

x %>%
  slice_max(speed, n = 3)
#> 
#> // safeframe object
#>   speed dist
#> 1    25   85
#> 2    24   70
#> 3    24   92
#> 4    24   93
#> 5    24  120
#> 
#> tagged variables:
#>  mph - speed
#>  distance - dist

x %>%
  slice_sample(n = 5)
#> 
#> // safeframe object
#>   speed dist
#> 1    23   54
#> 2    14   80
#> 3    12   14
#> 4    19   46
#> 5    19   68
#> 
#> tagged variables:
#>  mph - speed
#>  distance - dist

Verbs operating on columns

During operations on columns, safeframe will:

stay invisible and conserve tags if no tagged column is affected by the operation
trigger lost_tags_action() if tagged columns are affected by the operation

`dplyr::count()` ✅

Count introduces new columns and retains tags for the existing columns:

x %>%
  count(speed) %>%
  head()
#> 
#> // safeframe object
#>   speed n
#> 1     4 2
#> 2     7 2
#> 3     8 1
#> 4     9 1
#> 5    10 3
#> 6    11 2
#> 
#> tagged variables:
#>  mph - speed

`dplyr::mutate()` ✓ (partial)

There is an incomplete compatibility with dplyr::mutate() in that simple renames without any actual modification of the column don’t update the tags. In this scenario, users should rather use dplyr::rename()

Although dplyr::mutate() is not able to leverspeed to full power of safeframe tags, safeframe objects behave as expected the same way a data.frame would:

# In place modification doesn't lose tags
x %>%
  mutate(speed = speed + 10) %>%
  head()
#> 
#> // safeframe object
#>   speed dist
#> 1    14    2
#> 2    14   10
#> 3    17    4
#> 4    17   22
#> 5    18   16
#> 6    19   10
#> 
#> tagged variables:
#>  mph - speed
#>  distance - dist

# New columns don't affect existing tags
x %>%
  mutate(ticket = speed >= 50) %>%
  head()
#> 
#> // safeframe object
#>   speed dist ticket
#> 1     4    2  FALSE
#> 2     4   10  FALSE
#> 3     7    4  FALSE
#> 4     7   22  FALSE
#> 5     8   16  FALSE
#> 6     9   10  FALSE
#> 
#> tagged variables:
#>  mph - speed
#>  distance - dist

# .keep = "unused" generate expected tag loss conditions
x %>%
  mutate(edad = speed, .keep = "unused") %>%
  head()
#> Warning: The following tagged variables are lost:
#>  speed - mph
#> 
#> // safeframe object
#>   dist edad
#> 1    2    4
#> 2   10    4
#> 3    4    7
#> 4   22    7
#> 5   16    8
#> 6   10    9
#> 
#> tagged variables:
#>  distance - dist

`dplyr::pull()` ✅

dplyr::pull() returns a vector, which maintains the tag but result in loss of the safeframe class:

x %>%
  pull(speed)
#>  [1]  4  4  7  7  8  9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15
#> [26] 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25
#> attr(,"label")
#> [1] "mph"

`dplyr::relocate()` ✅

x %>%
  relocate(speed, .before = 1) %>%
  head()
#> 
#> // safeframe object
#>   speed dist
#> 1     4    2
#> 2     4   10
#> 3     7    4
#> 4     7   22
#> 5     8   16
#> 6     9   10
#> 
#> tagged variables:
#>  mph - speed
#>  distance - dist

`dplyr::rename()` & `dplyr::rename_with()` ✅

dplyr::rename() is fully compatible out-of-the-box with safeframe, meaning that tags will be updated at the same time that columns are renamed.

x %>%
  rename(edad = speed) %>%
  head()
#> 
#> // safeframe object
#>   edad dist
#> 1    4    2
#> 2    4   10
#> 3    7    4
#> 4    7   22
#> 5    8   16
#> 6    9   10
#> 
#> tagged variables:
#>  mph - edad
#>  distance - dist

x %>%
  rename_with(toupper) %>%
  head()
#> 
#> // safeframe object
#>   SPEED DIST
#> 1     4    2
#> 2     4   10
#> 3     7    4
#> 4     7   22
#> 5     8   16
#> 6     9   10
#> 
#> tagged variables:
#>  mph - SPEED
#>  distance - DIST

`dplyr::select()` ✅

dplyr::select() is fully compatible with safeframe, including when columns are renamed in a select():

# Works fine
x %>%
  select(speed, dist) %>%
  head()
#> 
#> // safeframe object
#>   speed dist
#> 1     4    2
#> 2     4   10
#> 3     7    4
#> 4     7   22
#> 5     8   16
#> 6     9   10
#> 
#> tagged variables:
#>  mph - speed
#>  distance - dist

# tags are updated!
x %>%
  select(dist, edad = speed) %>%
  head()
#> 
#> // safeframe object
#>   dist edad
#> 1    2    4
#> 2   10    4
#> 3    4    7
#> 4   22    7
#> 5   16    8
#> 6   10    9
#> 
#> tagged variables:
#>  distance - dist
#>  mph - edad

Verbs operating on groups ✘

Groups are not yet supported. Applying any verb operating on group to a safeframe will silently convert it back to a data.frame or tibble.

# Does not retain tags
x %>%
  group_by(speed) %>%
  head()
#> # A tibble: 6 × 2
#> # Groups:   speed [4]
#>   speed  dist
#>   <dbl> <dbl>
#> 1     4     2
#> 2     4    10
#> 3     7     4
#> 4     7    22
#> 5     8    16
#> 6     9    10

Please indicate if you would like to see this supported in a future release by commenting on the GitHub issue about this.

Verbs operating on data.frames

`dplyr::bind_rows()` ✅

dim(x)
#> [1] 50  2

dim(bind_rows(x, x))
#> [1] 100   2

`dplyr::bind_cols()` ✘

bind_cols() is currently incompatible with safeframe:

tags from the second element are lost
Warnings are produced about lost tags, even for tags that are not actually lost

bind_cols(
  suppressWarnings(select(x, speed)),
  suppressWarnings(select(x, dist))
) %>%
  head()
#> Warning: The following tagged variables are lost:
#>  speed - mph
#> Warning: The following tagged variables are lost:
#>  dist - distance
#> 
#> // safeframe object
#>   speed dist
#> 1     4    2
#> 2     4   10
#> 3     7    4
#> 4     7   22
#> 5     8   16
#> 6     9   10
#> 
#> tagged variables:
#>  mph - speed
#>  distance - dist

Joins ✘

Joins are currently not compatible with safeframe as tags from the second element are silently dropped.

full_join(
  suppressWarnings(select(x, speed, dist)),
  suppressWarnings(select(x, dist, speed))
) %>%
  head()
#> Joining with `by = join_by(speed, dist)`
#> Warning in full_join(suppressWarnings(select(x, speed, dist)), suppressWarnings(select(x, : Detected an unexpected many-to-many relationship between `x` and `y`.
#> ℹ Row 17 of `x` matches multiple rows in `y`.
#> ℹ Row 17 of `y` matches multiple rows in `x`.
#> ℹ If a many-to-many relationship is expected, set `relationship =
#>   "many-to-many"` to silence this warning.
#> 
#> // safeframe object
#>   speed dist
#> 1     4    2
#> 2     4   10
#> 3     7    4
#> 4     7   22
#> 5     8   16
#> 6     9   10
#> 
#> tagged variables:
#>  mph - speed
#>  distance - dist

Verbs operating on multiple columns

`dplyr::pick()` ✘

pick() makes tidyselect functions work in usually tidyselect-incompatible functions, such as:

x %>%
  dplyr::arrange(dplyr::pick(ends_with("loc"))) %>%
  head()
#> 
#> // safeframe object
#>   speed dist
#> 1     4    2
#> 2     4   10
#> 3     7    4
#> 4     7   22
#> 5     8   16
#> 6     9   10
#> 
#> tagged variables:
#>  mph - speed
#>  distance - dist

As such, we could expect it to work with safeframe custom tidyselect-like function: has_tag() but it’s not the case since pick() currently strips out all attributes, including the safeframe class and all tags. This unclassing is documented in ?pick:

pick() returns a data frame containing the selected columns for the current group.

Verbs operating on rows

dplyr::arrange() ✅

dplyr:distinct() ✅

dplyr::filter() ✅

dplyr::slice() ✅

Verbs operating on columns

dplyr::count() ✅

dplyr::mutate() ✓ (partial)

dplyr::pull() ✅

dplyr::relocate() ✅

dplyr::rename() & dplyr::rename_with() ✅

dplyr::select() ✅