Replace multiple `summarize`statements by function

Question

I'm currently repeating a lot code, since I need to summarize always the same columns for different groups. How can I do this effectively by writing the summarize function (which is always the same) only once, but define the output name and group_by arguments case by case?

A minimum example:

col1 <- c("UK", "US", "UK", "US")
col2 <- c("Tech", "Social", "Social", "Tech")
col3 <- c("0-5years", "6-10years", "0-5years", "0-5years")
col4 <- 1:4
col5 <- 5:8

df <- data.frame(col1, col2, col3, col4, col5)

result1 <- df %>% 
  group_by(col1, col2) %>% 
  summarize(sum1 = sum(col4, col5))

result2 <- df %>% 
  group_by(col2, col3) %>% 
  summarize(sum1 = sum(col4, col5))

result3 <- df %>% 
  group_by(col1, col3) %>% 
  summarize(sum1 = sum(col4, col5))

the `ddply` function is more succinct than `group_by %>% summarise`. You can re-write the first one as `ddply(df, .(col1, col2), summarise, sum1=sum(col5, col5))`. Doesn't answer your actual question but will cut down the number of lines you use — morgan121, Apr 29 '19 at 12:08

score 5 · Answer 1 · answered Apr 29 '19 at 12:20

5

Using combn:

combn(colnames(df)[1:3], 2, FUN = function(x){
  df %>% 
    group_by(.dots = x) %>% 
    summarize(sum1 = sum(col4, col5))
  }, simplify = FALSE)

answered Apr 29 '19 at 12:20

zx8754

52,746
12
114
209

score 2 · Answer 2 · answered Apr 29 '19 at 12:36

2

To use dplyr in own functions, you can use tidy evaluation. The reason for this is the way dplyr evaluates dplyr code, something called non standard evaluation, which wraps everything what does not behave like normal R Code. I recommend to read this:

https://tidyeval.tidyverse.org/modifying-inputs.html#modifying-quoted-expressions

summarizefunction <- function(data, ..., sumvar1, sumvar2) {

    groups <- enquos(...)
    sumvar1 <- enquo(sumvar1)
    sumvar2 <- enquo(sumvar2)

    result <- data %>%
        group_by(!!!groups) %>%
        summarise(sum1 = sum(!!sumvar1, !!sumvar2))
    return(result)
}

summarizefunction(df, col1, col2, sumvar1 = col4, sumvar2 = col5)

You can use the enquo keyword to wrap quote parameters which prevents them from being evaluated immediately. This you can use the !! (called bang bang) operator to unquote the parameter. I think this is the most flexible and reuseable solution, even when you have to write some more initial code.

answered Apr 29 '19 at 12:36

DSGym

2,807
1
6
18

This approach seems to be the most fitting one for me. Only one question: I have hundreds of different veriables to sum, divide etc. is there a way not to type/copy all of them in the `function()` part? – huan Apr 29 '19 at 13:10
My number of combinations of groups is much less (8). – huan Apr 29 '19 at 13:19
1

I would recommend you have a look at the reshape2 package. This way you could restructure your dataset in a tidy long format. For example: `reshape2::melt(df)`. Think about the `split` function then, which converts your long df in a list of smaller df´s. Then use the `lapply` function in combination witht the `summarizefunction` . Aggregating over mutiple columns is almost always a "not so nice" idea. If you like my solution, please accept my answer :-) – DSGym Apr 29 '19 at 14:08

nycrefugee · Answer 3 · 2019-04-29T12:33:00.297

Firstly you'll need to evaluate the variables with a function as such:

library(tidyverse)
res_func <- function(x, y){
  df %>% 
  group_by(!!as.symbol(x), !!as.symbol(y)) %>% 
  summarize(sum1 = sum(col4, col5))
}

works a charm:

res_func("col1", "col2")

# A tibble: 4 x 3
# Groups:   col1 [2]
  col1  col2    sum1
  <fct> <fct>  <int>
1 UK    Social    10
2 UK    Tech       6
3 US    Social     8
4 US    Tech      12

We can use assign to create a function that names your frame against the parameters you've passed in through the function:

res_func2 <- function(x, y){
  assign(paste0("result_", x, y),
         df %>% 
           group_by(!!as.symbol(x), !!as.symbol(y)) %>% 
           summarize(sum1 = sum(col4, col5)), 
         envir = parent.frame())
}

This creates a new df called result_col1col2 by simply running res_func2("col1", "col2")

> result_col1col2
# A tibble: 4 x 3
# Groups:   col1 [2]
  col1  col2    sum1
  <fct> <fct>  <int>
1 UK    Social    10
2 UK    Tech       6
3 US    Social     8
4 US    Tech      12

It would look simpler, if you pass one arg to the function. `res_func <- function(x){ df %>% group_by(!!as.symbol(x)) %>% summarize(sum1 = sum(col4, col5)) } ` — zx8754, Apr 29 '19 at 12:32
would that easily be assigned to the name of a new df as requested? — nycrefugee, Apr 29 '19 at 12:35

score 1 · Accepted Answer · answered May 02 '19 at 09:44

1

You can also use purrr::partial in these situations :

library(purrr)
summarize45 <- partial(summarize, sum1 = sum(col4, col5))

result1b <- df %>% 
  group_by(col1, col2) %>%
  summarize45()

identical(result1, result1b)
# [1] TRUE

Or pushing it further :

gb_df <- partial(group_by, df)

result1c <- gb_df(col1, col2) %>% summarize45()

identical(result1, result1c)
# [1] TRUE

answered May 02 '19 at 09:44

moodymudskipper

46,417
11
121
167

1

this is awesome @Moody_Mudskipper. exactly what I need, reduced my code immediately to 1/6 and I've not even half the work done. would give you more than +1 if I could. – huan May 02 '19 at 17:58

Replace multiple `summarize`statements by function

4 Answers4