3

I have a large dataset from which I wish to obtain summary estimates (mean, medians, counts, etc) of one column when grouped by two other columns.

Trying really hard to work out how to do this using purrr - hopefully to get this workflow to click for future projects... but very stuck.

As a reproducible example, this works for grouping by am and vs, and estimating summary values of mpg

library(tidyverse)
library(rlang)

mtcars %>%
  group_by(am, vs) %>%
  summarise(mean_mpg = mean(mpg),
            median_mpg = median(mpg),
            count = n())

However, to extend this example, say I wanted to group for am and vs; then am and gear; then am and carb. Intuitively, this seems to be something map should handle.

group_vars <- c("vs", "gear", "carb")
group_syms <- rlang::syms(group_vars)
sym_am <- rlang::sym("am")

mtcars %>%
  map_df(~group_by(!!sym_am, !!!group_syms) %>%
           summarise(mean_mpg = mean(mpg),
           summarise(median_mpg = median(mpg),
           summarise(count = n())
  )

#Error in !sym_am : invalid argument type
Peter MacPherson
  • 683
  • 1
  • 7
  • 17

2 Answers2

1

Here's one approach

library(tidyverse)

variable_grp <- c("vs", "gear", "carb")
constant_grp <- c("am")
group_vars <- lapply(variable_grp, function(i) c(constant_grp, i))

map(group_vars, ~group_by_at(mtcars, .x) %>% 
                summarise(  mean_mgp = mean(mpg),
                        median_mpg = median(mpg),
                        count = n()))

This will produce a list of the summary statistics for each group. The issue with using map_df with your problem is that your column names for each group are different (1st group: am, vs ; 2nd group: am, gear ...). Therefore, you need to rename the variable_column if you're using map_df

map_df(group_vars, ~group_by_at(mtcars, .x) %>% 
                summarise(  mean_mgp = mean(mpg),
                        median_mpg = median(mpg),
                        count = n()) %>%
                setNames(c("am", "variable_column", "mean_mpg", "median_mpg", "count")))

# A tibble: 17 x 5
# Groups:   am [2]
      # am variable_column mean_mpg median_mpg count
   # <dbl>           <dbl>    <dbl>      <dbl> <int>
 # 1     0               0 15.05000      15.20    12
 # 2     0               1 20.74286      21.40     7
 # 3     1               0 19.75000      20.35     6
 # 4     1               1 28.37143      30.40     7
 # 5     0               3 16.10667      15.50    15
 # 6     0               4 21.05000      21.00     4
 # 7     1               4 26.27500      25.05     8
 # 8     1               5 21.38000      19.70     5
 # 9     0               1 20.33333      21.40     3
# 10     0               2 19.30000      18.95     6
# 11     0               3 16.30000      16.40     3
# 12     0               4 14.30000      14.30     7
# 13     1               1 29.10000      29.85     4
# 14     1               2 27.05000      28.20     4
# 15     1               4 19.26667      21.00     3
# 16     1               6 19.70000      19.70     1
# 17     1               8 15.00000      15.00     1

You can save the variable_column name using the .id argument of map_df and a post-map_df mutate

map_df(group_vars, ~group_by_at(mtcars, .x) %>% 
                summarise(  mean_mgp = mean(mpg),
                        median_mpg = median(mpg),
                        count = n()) %>%
                setNames(c("am", "variable_column", "mean_mpg", "median_mpg", "count")),
            .id="variable_col_name") %>%
            mutate(variable_col_name = variable_grp[as.numeric(variable_col_name)])

# A tibble: 17 x 6
# Groups:   am [2]
   # variable_col_name    am variable_column mean_mpg median_mpg count
               # <chr> <dbl>           <dbl>    <dbl>      <dbl> <int>
 # 1                vs     0               0 15.05000      15.20    12
 # 2                vs     0               1 20.74286      21.40     7
 # 3                vs     1               0 19.75000      20.35     6
 # 4                vs     1               1 28.37143      30.40     7
 # 5              gear     0               3 16.10667      15.50    15
 # 6              gear     0               4 21.05000      21.00     4
 # 7              gear     1               4 26.27500      25.05     8
 # 8              gear     1               5 21.38000      19.70     5
 # 9              carb     0               1 20.33333      21.40     3
# 10              carb     0               2 19.30000      18.95     6
# 11              carb     0               3 16.30000      16.40     3
# 12              carb     0               4 14.30000      14.30     7
# 13              carb     1               1 29.10000      29.85     4
# 14              carb     1               2 27.05000      28.20     4
# 15              carb     1               4 19.26667      21.00     3
# 16              carb     1               6 19.70000      19.70     1
# 17              carb     1               8 15.00000      15.00     1
CPak
  • 13,260
  • 3
  • 30
  • 48
1

We could use the map2 from purrr to use multiple symbols as arguments and then evaluate it within the group_by and summarise the output

library(tidyverse)
map2_df(list(sym_am), group_syms, ~ mtcars %>%
         group_by(!!.x, !!.y) %>% 
         summarise(mean_mgp = mean(mpg), median_mpg = median(mpg),count = n()))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Thanks - very elegant! Can I ask why the `list(sym_am)` is required? I get "Error: `.x` is not a vector (symbol)" when I omit, but I thought I had already called as a symbol? – Peter MacPherson Dec 18 '17 at 17:46
  • 1
    @PeterMacPherson Using either `c(sym_am)` or `list(sym_am)` place it in a `list` and then it becomes recycled especially for `map`. The `base R` Map should work without it – akrun Dec 18 '17 at 17:50