1

I want to write an R function using dplyr to summarise a data set that accepts different numbers of grouping variables to the group_by statement - including no grouping at all. I have found answers to similar questions that use 'group_by_', but this has been deprecated (dplyr vrsion at time of writing is 1.1.2).

I have used different methods of passing vectors to the group_by statements attempting to use tidy evaluation, but none have worked as expected and failed to return an answer when no grouping is required.

Here's the basis for a reproduceable example using the starwars dataset. The function should be capable of returning summary tables of the Body-Mass Indexes (BMI) of the various creatures.

`star_wars_BMI <- function(group_vec) {
  df_out <- starwars %>% 
    mutate (BMI = height/mass^2) %>% 
    group_by(group_vec) %>% 
    summarise(height_mean = mean(height, na.rm = T),
              mass_mean = mean(mass, na.rm = T),
              BMI_mean = mean(BMI, na.rm = T))
  return(df_out)
}

group_vector0 <- c()  # ie. summarise for the whole galaxy
group_vector1 <- c("homeworld")  # summarise by homeworld planet
group_vector2 <- c("homeworld", "species") = summarise by species on each homeworld


galaxy_BMI <- star_wars_BMI(group_vec = group_vector0)
homeworld_BMI <- star_wars_BMI(group_vec = group_vector1)
`

I know it's a relatively simple task to produce separate functions for either no or some groups, but I would like to see if it is possible to do this with just one.

An explanation of the tidy evalation rationale would be very much appreciated - as would an example that went on to plot the summaries.

AdrianD
  • 41
  • 6

2 Answers2

3

Here is another option using the ellipsis or ... as argument to column names for group_by. Now we pass not a vector but the column names instead:

The rlang::ensyms(...) stores the column names as symbols, then !!!` unquotes them in the group_by function:

library(dplyr)

star_wars_BMI <- function(...) {
  
  group_vec <- rlang::ensyms(...)
  
  df_out <- starwars %>% 
    mutate (BMI = height/mass^2) %>% 
    group_by(!!!group_vec) %>% 
    summarise(height_mean = mean(height, na.rm = TRUE),
              mass_mean = mean(mass, na.rm = TRUE),
              BMI_mean = mean(BMI, na.rm = TRUE))
  
  return(df_out)
}


star_wars_BMI()
star_wars_BMI("homeworld")
star_wars_BMI("homeworld", "species")

output:

height_mean mass_mean BMI_mean
        <dbl>     <dbl>    <dbl>
1        174.      97.3   0.0481
> star_wars_BMI("homeworld")
# A tibble: 49 × 4
   homeworld      height_mean mass_mean BMI_mean
   <chr>                <dbl>     <dbl>    <dbl>
 1 Alderaan              176.      64     0.0463
 2 Aleen Minor            79       15     0.351 
 3 Bespin                175       79     0.0280
 4 Bestine IV            180      110     0.0149
 5 Cato Neimoidia        191       90     0.0236
 6 Cerea                 198       82     0.0294
 7 Champala              196      NaN   NaN     
 8 Chandrila             150      NaN   NaN     
 9 Concord Dawn          183       79     0.0293
10 Corellia              175       78.5   0.0284
# … with 39 more rows
# ℹ Use `print(n = ...)` to see more rows
> star_wars_BMI("homeworld", "species")
`summarise()` has grouped output by 'homeworld'. You can override using the
`.groups` argument.
# A tibble: 58 × 5
# Groups:   homeworld [49]
   homeworld      species   height_mean mass_mean BMI_mean
   <chr>          <chr>           <dbl>     <dbl>    <dbl>
 1 Alderaan       Human            176.      64     0.0463
 2 Aleen Minor    Aleena            79       15     0.351 
 3 Bespin         Human            175       79     0.0280
 4 Bestine IV     Human            180      110     0.0149
 5 Cato Neimoidia Neimodian        191       90     0.0236
 6 Cerea          Cerean           198       82     0.0294
 7 Champala       Chagrian         196      NaN   NaN     
 8 Chandrila      Human            150      NaN   NaN     
 9 Concord Dawn   Human            183       79     0.0293
10 Corellia       Human            175       78.5   0.0284
# … with 48 more rows
# ℹ Use `print(n = ...)` to see more rows
> 
TarJae
  • 72,363
  • 6
  • 19
  • 66
  • 1
    You can simply do `group_by(...)` and then you can pass unquoted arguments like `star_wars_BMI(homeworld, species)`. Note to the OP, i would recommend not returning a grouped data frame. Instead use the `groups = "drop"` argument of `summarise`, unless you know you specifically need a grouped data frame. – LMc May 24 '23 at 15:22
  • Thank you both. You're right about dropping the groups. In another project I recently found some code to be running VERY slowly (3 hour when it used to take less than a minute) and it took me far to long & too much hair pulled out to find out that someone upstream had grouped one of the regular files I was using for this. It didn't occur to me that grouping stayed intact in saved RDS files. – AdrianD May 25 '23 at 08:45
2

hope you are doing well

I believe you can use the across

Like:

star_wars_BMI <- function(group_vec) {
  df_out <- starwars %>% 
    mutate (BMI = height/mass^2) %>% 
    group_by(across(group_vec)) %>% 
    summarise(height_mean = mean(height, na.rm = T),
              mass_mean = mean(mass, na.rm = T),
              BMI_mean = mean(BMI, na.rm = T))
  return(df_out)
}

group_vector0 <- c()  # ie. summarise for the whole galaxy
group_vector1 <- c("homeworld")  # summarise by homeworld planet
group_vector2 <- c("homeworld", "species") # summarise by species on each homeworld


star_wars_BMI(group_vec = group_vector0)
star_wars_BMI(group_vec = group_vector1)
star_wars_BMI(group_vec = group_vector2)
Mohd_PH
  • 1,590
  • 12
  • 19