5

I want to create a function based on dplyr that performs certain operations on subsets of data. The subsets are defined by values of one or more key columns in the dataset. When only one column is used to identify subsets, my code works fine:

set.seed(1)
df <- tibble(
  g1 = c(1, 1, 2, 2, 2),
  g2 = c(1, 2, 1, 2, 1),
  a = sample(5)
)
group_key <- "g1"
aggregate <- function(df, by) {
  df %>% group_by(!!sym(by)) %>% summarize(a = mean(a))
}
aggregate(df, by = group_key)

This works as expected and returns something like this:

# A tibble: 2 x 2
     g1     a
  <dbl> <dbl>
1     1   1.5
2     2   4  

Unfortunately everything breaks down if I change group_key:

group_key <- c("g1", "g2")
aggregate(df, by = group_key)

I get an error: Only strings can be converted to symbols, which I think comes from rlang::sym(). Replacing it with syms() does not work since I get a list of names, on which group_by() chokes.

Any suggestions would be appreciated!

M--
  • 25,431
  • 8
  • 61
  • 93
kgolyaev
  • 565
  • 2
  • 10

3 Answers3

6

You need to use the unquote-splice operator !!!:

aggregate <- function(df, by) {
  df %>% group_by(!!!syms(by)) %>% summarize(a = mean(a))
}

group_key <- c("g1", "g2")

aggregate(df, by = group_key)
## A tibble: 4 x 3
## Groups:   g1 [2]
#     g1    g2     a
#  <dbl> <dbl> <dbl>
#1     1     1   1  
#2     1     2   4  
#3     2     1   2.5
#4     2     2   5 
dave-edison
  • 3,666
  • 7
  • 19
3

Alternatively, you can use dplyr::group_by_at:

agg <- function(df, by) {
  require(dplyr)
  df %>% group_by_at(vars(one_of(by))) %>% summarize(a = mean(a))}

group_key <- "g1"
group_keys <- c("g1","g2")

agg(df, by = group_key)
#> # A tibble: 2 x 2
#>      g1     a
#>   <dbl> <dbl>
#> 1     1  2.5 
#> 2     2  3.33

agg(df, by = group_keys)
#> # A tibble: 4 x 3
#> # Groups:   g1 [2]
#>      g1    g2     a
#>   <dbl> <dbl> <dbl>
#> 1     1     1   1  
#> 2     1     2   4  
#> 3     2     1   2.5
#> 4     2     2   5
M--
  • 25,431
  • 8
  • 61
  • 93
  • you can directly pass the character vector instead of `vars()`: `group_by_at(by)` – Lionel Henry Jul 12 '19 at 06:30
  • @LionelHenry Have you tried this? ```Error in UseMethod("tbl_vars") : no applicable method for 'tbl_vars' applied to an object of class "function"``` – M-- Jul 12 '19 at 13:31
  • Do you need to pipe a data frame into it? `mtcars %>% group_by_at(c("cyl", "am"))` – Lionel Henry Jul 13 '19 at 20:01
  • @LionelHenry well piping aint necessary never. You can always put it like this `group_by(df, col)`. Read about `magrittr::%>%` for more info. – M-- Jul 13 '19 at 20:22
  • Let me rephrase: Do you need to pass the data frame to `group_by()`? – Lionel Henry Jul 14 '19 at 22:14
  • @LionelHenry what? Yes. You want to group the **dataframe** by a column. Of course you need to pass it. – M-- Jul 14 '19 at 22:15
  • Can you please simplify your answer, `vars()` and `one_of()` basically cancel each other. Your answer is great but can be simplified like this: `df %>% group_by_at(by) %>% ...` – Lionel Henry Jul 16 '19 at 07:24
  • @LionelHenry I think I wasn't clear in my first comment responding to this suggestion of yours. The error that I had in my comment shows that your suggestion does not work . Please run your code and see for yourself. – M-- Jul 17 '19 at 04:50
  • Ah I see what's going on. The _current_ code in your answer gives the same error because you haven't defined `df`. I suggest running code with `reprex::reprex()` before posting on StackOverflow to make sure it runs correctly. You can also omit both `vars()` and `one_of()` as I mentioned. – Lionel Henry Jul 17 '19 at 07:13
1

Update with dplyr 1.0.0

The new across() allows tidyselect functions like all_of which replaces the quote-unqote procedure of NSE. The code looks a bit simpler with that:

aggregate <- function(df, by) {
  df %>% 
    group_by(across(all_of(by))) %>% 
    summarize(a = mean(a))
}

df %>% aggregate(group_key)
Agile Bean
  • 6,437
  • 1
  • 45
  • 53