4

I'm currently repeating a lot code, since I need to summarize always the same columns for different groups. How can I do this effectively by writing the summarize function (which is always the same) only once, but define the output name and group_by arguments case by case?

A minimum example:

col1 <- c("UK", "US", "UK", "US")
col2 <- c("Tech", "Social", "Social", "Tech")
col3 <- c("0-5years", "6-10years", "0-5years", "0-5years")
col4 <- 1:4
col5 <- 5:8

df <- data.frame(col1, col2, col3, col4, col5)

result1 <- df %>% 
  group_by(col1, col2) %>% 
  summarize(sum1 = sum(col4, col5))

result2 <- df %>% 
  group_by(col2, col3) %>% 
  summarize(sum1 = sum(col4, col5))

result3 <- df %>% 
  group_by(col1, col3) %>% 
  summarize(sum1 = sum(col4, col5))
huan
  • 308
  • 3
  • 15
  • the `ddply` function is more succinct than `group_by %>% summarise`. You can re-write the first one as `ddply(df, .(col1, col2), summarise, sum1=sum(col5, col5))`. Doesn't answer your actual question but will cut down the number of lines you use – morgan121 Apr 29 '19 at 12:08

4 Answers4

5

Using combn:

combn(colnames(df)[1:3], 2, FUN = function(x){
  df %>% 
    group_by(.dots = x) %>% 
    summarize(sum1 = sum(col4, col5))
  }, simplify = FALSE)
zx8754
  • 52,746
  • 12
  • 114
  • 209
2

To use dplyr in own functions, you can use tidy evaluation. The reason for this is the way dplyr evaluates dplyr code, something called non standard evaluation, which wraps everything what does not behave like normal R Code. I recommend to read this:

https://tidyeval.tidyverse.org/modifying-inputs.html#modifying-quoted-expressions

summarizefunction <- function(data, ..., sumvar1, sumvar2) {

    groups <- enquos(...)
    sumvar1 <- enquo(sumvar1)
    sumvar2 <- enquo(sumvar2)

    result <- data %>%
        group_by(!!!groups) %>%
        summarise(sum1 = sum(!!sumvar1, !!sumvar2))
    return(result)
}

summarizefunction(df, col1, col2, sumvar1 = col4, sumvar2 = col5)

You can use the enquo keyword to wrap quote parameters which prevents them from being evaluated immediately. This you can use the !! (called bang bang) operator to unquote the parameter. I think this is the most flexible and reuseable solution, even when you have to write some more initial code.

DSGym
  • 2,807
  • 1
  • 6
  • 18
  • This approach seems to be the most fitting one for me. Only one question: I have hundreds of different veriables to sum, divide etc. is there a way not to type/copy all of them in the `function()` part? – huan Apr 29 '19 at 13:10
  • My number of combinations of groups is much less (8). – huan Apr 29 '19 at 13:19
  • 1
    I would recommend you have a look at the reshape2 package. This way you could restructure your dataset in a tidy long format. For example: `reshape2::melt(df)`. Think about the `split` function then, which converts your long df in a list of smaller df´s. Then use the `lapply` function in combination witht the `summarizefunction` . Aggregating over mutiple columns is almost always a "not so nice" idea. If you like my solution, please accept my answer :-) – DSGym Apr 29 '19 at 14:08
1

Firstly you'll need to evaluate the variables with a function as such:

library(tidyverse)
res_func <- function(x, y){
  df %>% 
  group_by(!!as.symbol(x), !!as.symbol(y)) %>% 
  summarize(sum1 = sum(col4, col5))
}

works a charm:

res_func("col1", "col2")

# A tibble: 4 x 3
# Groups:   col1 [2]
  col1  col2    sum1
  <fct> <fct>  <int>
1 UK    Social    10
2 UK    Tech       6
3 US    Social     8
4 US    Tech      12

We can use assign to create a function that names your frame against the parameters you've passed in through the function:

res_func2 <- function(x, y){
  assign(paste0("result_", x, y),
         df %>% 
           group_by(!!as.symbol(x), !!as.symbol(y)) %>% 
           summarize(sum1 = sum(col4, col5)), 
         envir = parent.frame())
}

This creates a new df called result_col1col2 by simply running res_func2("col1", "col2")

> result_col1col2
# A tibble: 4 x 3
# Groups:   col1 [2]
  col1  col2    sum1
  <fct> <fct>  <int>
1 UK    Social    10
2 UK    Tech       6
3 US    Social     8
4 US    Tech      12
nycrefugee
  • 1,629
  • 1
  • 10
  • 23
1

You can also use purrr::partial in these situations :

library(purrr)
summarize45 <- partial(summarize, sum1 = sum(col4, col5))

result1b <- df %>% 
  group_by(col1, col2) %>%
  summarize45()

identical(result1, result1b)
# [1] TRUE

Or pushing it further :

gb_df <- partial(group_by, df)

result1c <- gb_df(col1, col2) %>% summarize45()

identical(result1, result1c)
# [1] TRUE
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
  • 1
    this is awesome @Moody_Mudskipper. exactly what I need, reduced my code immediately to 1/6 and I've not even half the work done. would give you more than +1 if I could. – huan May 02 '19 at 17:58