0

I want to group a data frame using different sets of grouping variables. For each group I want to count the number of observations (or summarize in any other way) and then collect all results in one data frame.

Important: I want to define the sets of grouping variables programmatically, for example as a list.

How do I achieve this in the tidyverse?

Here is my attempt:

library(tidyverse)

count_by_group <- function(...) {
  mtcars %>%
    count(...) %>%
    mutate(
      grouping_variable = paste(ensyms(...), collapse = "."),
      group = paste(!!!enquos(...), sep = ".")
    ) %>%
    select(grouping_variable, group, n)
}

# I want this ...
bind_rows(
  count_by_group(cyl),
  count_by_group(gear),
  count_by_group(cyl, gear)
)
#>    grouping_variable group  n
#> 1                cyl     4 11
#> 2                cyl     6  7
#> 3                cyl     8 14
#> 4               gear     3 15
#> 5               gear     4 12
#> 6               gear     5  5
#> 7           cyl.gear   4.3  1
#> 8           cyl.gear   4.4  8
#> 9           cyl.gear   4.5  2
#> 10          cyl.gear   6.3  2
#> 11          cyl.gear   6.4  4
#> 12          cyl.gear   6.5  1
#> 13          cyl.gear   8.3 12
#> 14          cyl.gear   8.5  2

# ... but without the repetition of "count_by_group(var)".
# The following does not work:
map_dfr(
  list(
    cyl,
    gear,
    c(cyl, gear)
  ),
  count_by_group
)
#> Error in map(.x, .f, ...): object 'cyl' not found

Created on 2020-09-17 by the reprex package (v0.3.0)

robust
  • 594
  • 5
  • 17
  • 1
    Perhaps you need `rollup` aggregation i.e. `rollup(as.data.table(mtcars), j = .N, by = c("cyl","gear"))` – akrun Sep 17 '20 at 18:45
  • 1
    And what's wrong with what you've done? – Allan Cameron Sep 17 '20 at 18:45
  • @AllanCameron I guess the Op's code is not working with `map` at the end of the post – akrun Sep 17 '20 at 18:48
  • If you use `map_dfr(rlang::exprs(cyl, gear), count_by_group)` should work, but the last expression with `c` is not what the intended behavior your showed in the working case – akrun Sep 17 '20 at 18:54

1 Answers1

2

Update (2020-10-12): More transparent solution (thanks to @LionelHenry)

library(tidyverse)

count_by_group <- function(...) {
  dots <- enquos(..., .named = TRUE)
  names <- names(dots)

  counted <- count(mtcars, !!!dots)

  group <- counted %>%
    select(-n) %>%
    rowwise() %>%
    mutate(paste(c_across(), collapse = ".")) %>%
    pull()

  # # Equivalently:
  # group <- counted %>%
  #   select(-n) %>%
  #   pmap(counted, paste, sep = ".")

  counted %>%
    mutate(
      grouping_variable = paste(names, collapse = "."),
      group = group
    ) %>%
    select(grouping_variable, group, n)
}

grouping_variables <- list(
  vars(cyl),
  vars(gear),
  vars(cyl, gear)
)

map_dfr(grouping_variables, ~ count_by_group(!!! .x))
#>    grouping_variable group  n
#> 1                cyl     4 11
#> 2                cyl     6  7
#> 3                cyl     8 14
#> 4               gear     3 15
#> 5               gear     4 12
#> 6               gear     5  5
#> 7           cyl.gear   4.3  1
#> 8           cyl.gear   4.4  8
#> 9           cyl.gear   4.5  2
#> 10          cyl.gear   6.3  2
#> 11          cyl.gear   6.4  4
#> 12          cyl.gear   6.5  1
#> 13          cyl.gear   8.3 12
#> 14          cyl.gear   8.5  2

Created on 2020-10-12 by the reprex package (v0.3.0)


I just found that this works!

library(tidyverse)

count_by_group <- function(...) {
  mtcars %>%
    count(...) %>%
    mutate(
      grouping_variable = paste(ensyms(...), collapse = "."),
      group = paste(!!!enquos(...), sep = ".")
    ) %>%
    select(grouping_variable, group, n)
}

grouping_variables <- list(
  vars(cyl),
  vars(gear),
  vars(cyl, gear)
)

map_dfr(grouping_variables, ~count_by_group(!!! .))
#>    grouping_variable group  n
#> 1                cyl     4 11
#> 2                cyl     6  7
#> 3                cyl     8 14
#> 4               gear     3 15
#> 5               gear     4 12
#> 6               gear     5  5
#> 7           cyl.gear   4.3  1
#> 8           cyl.gear   4.4  8
#> 9           cyl.gear   4.5  2
#> 10          cyl.gear   6.3  2
#> 11          cyl.gear   6.4  4
#> 12          cyl.gear   6.5  1
#> 13          cyl.gear   8.3 12
#> 14          cyl.gear   8.5  2

Created on 2020-10-12 by the reprex package (v0.3.0)

robust
  • 594
  • 5
  • 17
  • `modify_depth(grouping_variables, 2, sym)` is an alternative to `map(grouping_variables, ~map(., sym))` – robust Sep 17 '20 at 19:20
  • I would do `list(vars(foo, bar), vars(baz))`. The `vars()` operator is meant for this use case of capturing multiple expressions externally. – Lionel Henry Sep 18 '20 at 06:00
  • `paste()`ing several variables together to get a group identifier doesn't seem very robust but will work in simple cases (integers, characters, ...). – Lionel Henry Sep 18 '20 at 06:03
  • Why do you use `ensyms(...)` at one point and `enquos(...)` at another? It seems better to be consistent. In this case I would settle on `ensyms()`, which is a little limiting but for good reasons: You are passing unevaluated `...` multiple times. This means they will be evaluated multiple times. No big deal when they are just column names, but any computations (like vars(cyl + rnorm(1))`) would be performed multiple times, which is not correct programming. – Lionel Henry Sep 18 '20 at 06:06
  • Thank you for your comments @LionelHenry. I have tried to implement a variant using only `vars` and `ensyms`, but I could not get it to work. Could you explain a bit more how the code would look like using only those commands? – robust Sep 20 '20 at 21:48
  • Would you have any ideas for an alternative solution to identify a group? My goal of the resulting data frame is to create a single plot that shows the data from all different groupings. Pasting the variables into a group identifier allows me to easily achieve this using `df %>% ggplot(aes(x = group, y = n)) + geom_col()`. – robust Sep 20 '20 at 22:00
  • ... or rather `df %>% ggplot(aes(x = interaction(grouping_variable, group), y = n)) + geom_col() + scale_x_discrete(labels = df$group)`, which keeps duplicate group values separated. – robust Sep 21 '20 at 04:26
  • 1
    If you defuse arguments with an external `vars()` then you wouldn't use internal defusing with `ensyms()`. Regarding group identification, one way is with `vctrs::vec_group_id()` which is used in dplyr internals. Note that this API is still maturing and we'll probably change the name at some point, but with a deprecation period. – Lionel Henry Sep 22 '20 at 09:01
  • Thank you for the pointer to `vctrs::vec_group_id()`. – robust Sep 24 '20 at 00:46
  • I still could not get the version with `vars()` to work. I have removed `ensyms()`, but the defused arguments still seem to need *some* kind of internal (de-)fusing. Would you be able to give more direction? ```count_by_group <- function(...) { browser() mtcars %>% count(...) %>% mutate( grouping_variable = paste(..., collapse = "."), group = paste(..., sep = ".") ) select(grouping_variable, group, n) } grouping_variables <- list( vars(cyl), vars(gear), vars(cyl, gear) ) map_dfr(grouping_variables, count_by_group)``` – robust Sep 24 '20 at 00:46
  • 1
    If your function takes dots and you want to pass a list of arguments created with `vars()`, you need to transform the list to individual arguments with `!!!`. Try something like `map_dfr(grouping_variables, ~ count_by_group(!!!.x))` – Lionel Henry Sep 29 '20 at 11:36
  • Thank you for your continued help. I have incorporated your comment, but now `group = paste(..., sep = ".")` raises an error. Would you have any more comments so that I can get your version to work? Here is where I am at: `count_by_group <- function(...) { mtcars %>% count(...) %>% mutate( grouping_variable = paste(ensyms(...), collapse = "."), group = paste(..., sep = ".") ) %>% select(grouping_variable, group, n) } grouping_variables <- list( vars(cyl), vars(gear), vars(cyl, gear) ) map_dfr(grouping_variables, ~ count_by_group(!!! .))` – robust Oct 03 '20 at 01:02
  • Can we find another medium for this discussion? Maybe in the comments of a gist? My github name is `@lionel-`. – Lionel Henry Oct 03 '20 at 10:38
  • @LionelHenry sure, I tagged you in a gist! – robust Oct 09 '20 at 04:15
  • Thank you. I updated my original solution using `vars` and added your more transparent solution. – robust Oct 13 '20 at 04:25