16

When I use group_by and summarise in dplyr, I can naturally apply different summary functions to different variables. For instance:

    library(tidyverse)

    df <- tribble(
      ~category,   ~x,  ~y,  ~z,
      #----------------------
          'a',      4,   6,   8,
          'a',      7,   3,   0,
          'a',      7,   9,   0,
          'b',      2,   8,   8,
          'b',      5,   1,   8,
          'b',      8,   0,   1,
          'c',      2,   1,   1,
          'c',      3,   8,   0,
          'c',      1,   9,   1
     )

    df %>% group_by(category) %>% summarize(
      x=mean(x),
      y=median(y),
      z=first(z)
    )

results in output:

    # A tibble: 3 x 4
      category     x     y     z
         <chr> <dbl> <dbl> <dbl>
    1        a     6     6     8
    2        b     5     1     8
    3        c     2     8     1

My question is, how would I do this with summarise_at? Obviously for this example it's unnecessary, but assume I have lots of variables that I want to take the mean of, lots of medians, etc.

Do I lose this functionality once I move to summarise_at? Do I have to use all functions on all groups of variables and then throw away the ones I don't want?

Perhaps I'm just missing something, but I can't figure it out, and I don't see any examples of this in the documentation. Any help is appreciated.

David Pepper
  • 593
  • 1
  • 4
  • 14
  • The base `Map` functionality can do this, `Map(function(f,v) f(v), c(mean,median,first), df[c("x","y","z")])` for instance. Maybe `purrr`'s `map` could do something similar? – thelatemail Sep 13 '17 at 03:51
  • Yes, I was wondering if purrr could offer us a way out of this. It's worth investigating. But in your example aren't you just applying all functions to all variables? And how would you use this with group_by? – David Pepper Sep 13 '17 at 04:09
  • Nope, I'm applying each function in turn to each variable with `Map` - see the results of `mean(df$x); median(df$y); first(df$z)` and compare to the `Map` code. – thelatemail Sep 13 '17 at 04:24
  • OK, I see what you mean, but my question here is the same as to ycw: what if I have three variables for the first function, ten for the second and one for the third? And this looks like a substitute for summarise_at rather than something to put inside it. I guess I'm asking for the complete code, because when I apply your suggestion to my sample data frame I don't get the answer I'm looking for. – David Pepper Sep 13 '17 at 04:39

2 Answers2

12

Here is one idea.

library(tidyverse)

df_mean <- df %>%
  group_by(category) %>%
  summarize_at(vars(x), funs(mean(.)))

df_median <- df %>%
  group_by(category) %>%
  summarize_at(vars(y), funs(median(.)))

df_first <- df %>%
  group_by(category) %>%
  summarize_at(vars(z), funs(first(.)))

df_summary <- reduce(list(df_mean, df_median, df_first), 
                     left_join, by = "category")

Like you said, there is no need to use summarise_at for this example. However, if you have a lot of columns need to be summarized by different functions, this strategy may work. You will need to specify the columns in the vars(...) for each summarize_at. The rule is the same as the dplyr::select function.

Update

Here is another idea. Define a function which modifies the summarise_at function, and then use map2 to apply this function with a look-up list showing variables and associated functions to apply. In this example, I applied mean to x and y column and median to z.

# Define a function
summarise_at_fun <- function(variable, func, data){
  data2 <- data %>%
    summarise_at(vars(variable), funs(get(func)(.)))
  return(data2)
}

# Group the data
df2 <- df %>% group_by(category)

# Create a look-up list with function names and variable to apply
look_list <- list(mean = c("x", "y"),
                  median = "z")

# Apply the summarise_at_fun
map2(look_list, names(look_list), summarise_at_fun, data = df2) %>%
  reduce(left_join, by = "category")

# A tibble: 3 x 4
  category     x     y     z
     <chr> <dbl> <dbl> <dbl>
1        a     6     6     0
2        b     5     3     8
3        c     2     6     1
www
  • 38,575
  • 12
  • 48
  • 84
  • 1
    This is indeed possible, and more elegant than the various "long" solutions that I had considered. But wouldn't it be nice to do it in one command? Also, is there any way to control the names of the resulting columns when using summarise_at? – David Pepper Sep 13 '17 at 03:50
  • @DavidEpstein It is possible to assign name using `summarise_at`. You can do `funs(x = mean(.))`, which leads to `Col_x` where `Col` is the original column name. – www Sep 13 '17 at 03:54
  • @DavidEpstein As for your first question, I am sure if it is possible. I have developed this answer before: https://stackoverflow.com/questions/45801972/find-average-by-group-over-a-time-period-and-retrieve-last-date-for-same-period/45802176#45802176 to apply different functions based on different conditions. However, since you did not specify any condition of columns you want to test, I do not know how to implement a similar approach. – www Sep 13 '17 at 03:57
  • Thanks for the links, but I still don't see anything there about applying one function to one subset of variables and another function to another subset. – David Pepper Sep 13 '17 at 04:06
  • @DavidEpstein Please see my update. This is probably more relevant to what you want. You need to create a new function and create a look-up table to show the relationship between variable names and functions to apply. – www Sep 13 '17 at 04:21
  • I like your update too, but it's my understanding that for map2, the lengths of the x and y variables have to be the same. In the example this is true, of course, but more generally any number of variables might be summarized by each function. Would you method work if, say a list of variables was input as the first element passed to map2? – David Pepper Sep 13 '17 at 04:35
  • @DavidEpstein Please see my updates again. What we need may not be a look-up table but a look-up "list". By doing this, we can store any numbers of column names in one function. – www Sep 13 '17 at 05:05
  • Out of interest, here's the data.table version - `df[ , unlist(Map( function(data,vars,fun) lapply(data[, vars, with=FALSE], fun), .(.SD), .(c("x","y"), c("z")), c(mean, median)), recursive=FALSE ), by=category ]` – thelatemail Sep 13 '17 at 05:13
  • @thelatemail Thanks for sharing. Would you like to post your solution as an answer? I think at least some people may be interested in. – www Sep 13 '17 at 05:56
  • This is a clever solution also. Perhaps I wasn't clear as to the intention of my initial post. I'd rather do this operation via a tidy pipe, and if it isn't possible then I consider this something of a bug, or at least a shortcoming in the current dplyr package that could usefully be addressed. – David Pepper Sep 13 '17 at 19:42
6

Since your question is about "summarise_at";

Here is what my idea is:

df %>% group_by(category) %>% 
 summarise_at(vars(x, y, z),
      funs(mean = mean, sd = sd, min = min),
      na.rm = TRUE)
Suraj Rao
  • 29,388
  • 11
  • 94
  • 103
dido
  • 77
  • 1
  • 7