Write function that access data from dplyr context

Question

Disclaimer: this is a very elemental question. I'll use an example to make it easier, but the question has nothing to do with the example itself.

Supose you have a dataframe df:

# A tibble: 5 × 4
  index     a     b     c
  <int> <int> <dbl> <dbl>
1     1     0     0     1
2     2     1     0     0
3     3     0     1     0
4     4     0     1     0
5     5     1     0     0

And you want to gather the dummies into a single factor column. Getting inspiration from eatATA::dummiesToFactor(), you could use something like:

dum2fac <- function(data) { factor(names(data)[max.col(data)]) }

df %>% mutate(name = dum2fac(across(a:c)))

# A tibble: 5 × 5
  index     a     b     c name 
  <int> <int> <dbl> <dbl> <fct>
1     1     0     0     1 c    
2     2     1     0     0 a    
3     3     0     1     0 b    
4     4     0     1     0 b    
5     5     1     0     0 a

Now suppose you want to modify dum2fac() to allow for something like the following:

df %>% mutate(name = dum2fac(a:c))

I tried one specific path, and from that my "more elemental" question appeared. This was what I tried:

dum2fac <- function(expr) {
  data <- select(???, {{expr}})
  factor(names(data)[max.col(data)])}

Where a:c will be passed onto expr, and ??? should stand for "the dataset that is being used in the dplyr context". Another way of putting it: across(a:c) doesn't refer directly to the dataset df, it just know that it needs to access it because of the context where it is used, and I want my function to be able to do the same.

Some concepts I figured could help were the "rlang fake data pronoun" .data, and some higher order functions/objects that are used in across and mutate, like the R6 object DataMask, peek_mask(), and others that probably aren't a good practice to use even if possible.

Obs: I'm glad to hear if you have a better path to rewrite dum2fac(), please add it too. But again, that's not exactly what this question is about.

Dummy data:

set.seed(2023)
df <- tibble(index = 1:5,
             a = sample(0:1, 5, TRUE),
             b = (1 - a) * sample(0:1, 5, TRUE),
             c = 1 - a - b)

score 3 · Accepted Answer · answered May 26 '23 at 21:09

3

You can use across() or (more idiomatically) pick() inside your own function:

library(dplyr)
set.seed(2023)

df <- tibble(
  index = 1:5,
  a = sample(0:1, 5, TRUE),
  b = (1 - a) * sample(0:1, 5, TRUE),
  c = 1 - a - b
)

dum2fac <- function(expr) {
  data <- pick({{ expr }})
  factor(names(data)[max.col(data)])
}

df %>% mutate(name = dum2fac(a:c))
#> # A tibble: 5 × 5
#>   index     a     b     c name 
#>   <int> <int> <dbl> <dbl> <fct>
#> 1     1     0     0     1 c    
#> 2     2     1     0     0 a    
#> 3     3     0     1     0 b    
#> 4     4     0     1     0 b    
#> 5     5     1     0     0 a

If you want the full data without selections, use pick(everything()).

answered May 26 '23 at 21:09

Mikko Marttila

10,972
18
31

This has the benefit of working correctly within a `group_by` grouping, but it has the liability that it will not work outside of `dplyr` verbs. (Not that that's an easy thing to do well, mind you :-) – r2evans May 26 '23 at 21:28
I could've sworn I tried that haha. Also, didn't knew `pick`, it truly makes it more clear. Very nice thanks. I'll leave the question open for a little bit to see if anyone has another take, then I'll accept it :) – Ricardo Semião e Castro May 26 '23 at 21:55

Write function that access data from dplyr context

1 Answers1