R function to summarise using dplyr group_by with flexibble groups, including no grouping at all

Question

I want to write an R function using dplyr to summarise a data set that accepts different numbers of grouping variables to the group_by statement - including no grouping at all. I have found answers to similar questions that use 'group_by_', but this has been deprecated (dplyr vrsion at time of writing is 1.1.2).

I have used different methods of passing vectors to the group_by statements attempting to use tidy evaluation, but none have worked as expected and failed to return an answer when no grouping is required.

Here's the basis for a reproduceable example using the starwars dataset. The function should be capable of returning summary tables of the Body-Mass Indexes (BMI) of the various creatures.

`star_wars_BMI <- function(group_vec) {
  df_out <- starwars %>% 
    mutate (BMI = height/mass^2) %>% 
    group_by(group_vec) %>% 
    summarise(height_mean = mean(height, na.rm = T),
              mass_mean = mean(mass, na.rm = T),
              BMI_mean = mean(BMI, na.rm = T))
  return(df_out)
}

group_vector0 <- c()  # ie. summarise for the whole galaxy
group_vector1 <- c("homeworld")  # summarise by homeworld planet
group_vector2 <- c("homeworld", "species") = summarise by species on each homeworld


galaxy_BMI <- star_wars_BMI(group_vec = group_vector0)
homeworld_BMI <- star_wars_BMI(group_vec = group_vector1)
`

I know it's a relatively simple task to produce separate functions for either no or some groups, but I would like to see if it is possible to do this with just one.

An explanation of the tidy evalation rationale would be very much appreciated - as would an example that went on to plot the summaries.

TarJae · Answer 1 · 2023-05-24T15:16:54.500

Here is another option using the ellipsis or ... as argument to column names for group_by. Now we pass not a vector but the column names instead:

The rlang::ensyms(...) stores the column names as symbols, then !!!` unquotes them in the group_by function:

library(dplyr)

star_wars_BMI <- function(...) {
  
  group_vec <- rlang::ensyms(...)
  
  df_out <- starwars %>% 
    mutate (BMI = height/mass^2) %>% 
    group_by(!!!group_vec) %>% 
    summarise(height_mean = mean(height, na.rm = TRUE),
              mass_mean = mean(mass, na.rm = TRUE),
              BMI_mean = mean(BMI, na.rm = TRUE))
  
  return(df_out)
}


star_wars_BMI()
star_wars_BMI("homeworld")
star_wars_BMI("homeworld", "species")

output:

height_mean mass_mean BMI_mean
        <dbl>     <dbl>    <dbl>
1        174.      97.3   0.0481
> star_wars_BMI("homeworld")
# A tibble: 49 × 4
   homeworld      height_mean mass_mean BMI_mean
   <chr>                <dbl>     <dbl>    <dbl>
 1 Alderaan              176.      64     0.0463
 2 Aleen Minor            79       15     0.351 
 3 Bespin                175       79     0.0280
 4 Bestine IV            180      110     0.0149
 5 Cato Neimoidia        191       90     0.0236
 6 Cerea                 198       82     0.0294
 7 Champala              196      NaN   NaN     
 8 Chandrila             150      NaN   NaN     
 9 Concord Dawn          183       79     0.0293
10 Corellia              175       78.5   0.0284
# … with 39 more rows
# ℹ Use `print(n = ...)` to see more rows
> star_wars_BMI("homeworld", "species")
`summarise()` has grouped output by 'homeworld'. You can override using the
`.groups` argument.
# A tibble: 58 × 5
# Groups:   homeworld [49]
   homeworld      species   height_mean mass_mean BMI_mean
   <chr>          <chr>           <dbl>     <dbl>    <dbl>
 1 Alderaan       Human            176.      64     0.0463
 2 Aleen Minor    Aleena            79       15     0.351 
 3 Bespin         Human            175       79     0.0280
 4 Bestine IV     Human            180      110     0.0149
 5 Cato Neimoidia Neimodian        191       90     0.0236
 6 Cerea          Cerean           198       82     0.0294
 7 Champala       Chagrian         196      NaN   NaN     
 8 Chandrila      Human            150      NaN   NaN     
 9 Concord Dawn   Human            183       79     0.0293
10 Corellia       Human            175       78.5   0.0284
# … with 48 more rows
# ℹ Use `print(n = ...)` to see more rows
>

You can simply do `group_by(...)` and then you can pass unquoted arguments like `star_wars_BMI(homeworld, species)`. Note to the OP, i would recommend not returning a grouped data frame. Instead use the `groups = "drop"` argument of `summarise`, unless you know you specifically need a grouped data frame. — LMc, May 24 '23 at 15:22
Thank you both. You're right about dropping the groups. In another project I recently found some code to be running VERY slowly (3 hour when it used to take less than a minute) and it took me far to long & too much hair pulled out to find out that someone upstream had grouped one of the regular files I was using for this. It didn't occur to me that grouping stayed intact in saved RDS files. — AdrianD, May 25 '23 at 08:45

score 2 · Answer 2 · answered May 24 '23 at 14:08

2

hope you are doing well

I believe you can use the across

Like:

star_wars_BMI <- function(group_vec) {
  df_out <- starwars %>% 
    mutate (BMI = height/mass^2) %>% 
    group_by(across(group_vec)) %>% 
    summarise(height_mean = mean(height, na.rm = T),
              mass_mean = mean(mass, na.rm = T),
              BMI_mean = mean(BMI, na.rm = T))
  return(df_out)
}

group_vector0 <- c()  # ie. summarise for the whole galaxy
group_vector1 <- c("homeworld")  # summarise by homeworld planet
group_vector2 <- c("homeworld", "species") # summarise by species on each homeworld


star_wars_BMI(group_vec = group_vector0)
star_wars_BMI(group_vec = group_vector1)
star_wars_BMI(group_vec = group_vector2)

answered May 24 '23 at 14:08

Mohd_PH

1,590
12
19

Normally dplyr warns against using external variables without the selection helpers. In this case it doesn't, why? Shouldn't we use `across(all_of(group_vec))`? – Ricardo Semião e Castro May 24 '23 at 14:34
1

@RicardoSemiãoeCastro Honestly I'm not sure why :D – Mohd_PH May 24 '23 at 14:41
Thank you - this worked. I did receive the dplyr warning to use 'all_of' on my session though (not a problem). – AdrianD May 25 '23 at 08:51

R function to summarise using dplyr group_by with flexibble groups, including no grouping at all

2 Answers2