5

I am trying to apply multiple functions to multiple columns by a grouping variable. I can get the results, but not in a useful format. Below, I would like res2 to be an extension of res1 by the by variable "cyl" and as many rows as unique values of cyl.

I have tried omitting unlist and redefining the my.sum.function to return a numeric rather than a list. But I can't get the format I need.

library(data.table)

## The well known data 
data(mtcars)
DT <- data.table(mtcars)

## a custom set of summary functions
my.sum.fun = function(x){list(
    mean   = mean(x, na.rm=T),
    median = median(x, na.rm=T),
    sd     = sd(x, na.rm=T)
    )}

## I can summarize multiple columns. This works
res1 <- DT[,unlist(lapply(.SD,my.sum.fun)),.SDcols=c("mpg","hp")]
res1
 mpg.mean mpg.median     mpg.sd    hp.mean  hp.median      hp.sd 
 20.090625  19.200000   6.026948 146.687500 123.000000  68.562868 

## Now I add a by column. What I would like is the format as res1 but with the by column "cyl" added and with as many rows as unique values of "cyl".
res2 <- DT[,unlist(lapply(.SD,my.sum.fun)),.SDcols=c("mpg","hp"),by=list(cyl)]
res2
    cyl         V1
 1:   6  19.742857
 2:   6  19.700000
 3:   6   1.453567
 4:   6 122.285714
 5:   6 110.000000
 6:   6  24.260491
 7:   4  26.663636
 8:   4  26.000000
 9:   4   4.509828
10:   4  82.636364
11:   4  91.000000
12:   4  20.934530
13:   8  15.100000
14:   8  15.200000
15:   8   2.560048
16:   8 209.214286
17:   8 192.500000
18:   8  50.976886
John Doe
  • 112
  • 6

3 Answers3

3

There is an option in unlist to avoid unlisting recursively - the recursive parameter (By default, the recursive = TRUE)

DT[,unlist(lapply(.SD,my.sum.fun), 
      recursive = FALSE),.SDcols=c("mpg","hp"),by=list(cyl)]
#   cyl mpg.mean mpg.median   mpg.sd   hp.mean hp.median    hp.sd
#1:   6 19.74286       19.7 1.453567 122.28571     110.0 24.26049
#2:   4 26.66364       26.0 4.509828  82.63636      91.0 20.93453
#3:   8 15.10000       15.2 2.560048 209.21429     192.5 50.97689
akrun
  • 874,273
  • 37
  • 540
  • 662
1

I realize it may seem a little silly to use dplyr within data.table, but I don't think summarize_all is any slower than lapply and this will still let you take advantage of data table's fast grouping, etc.

library(dplyr)

my_funs <- list(
    mean   = function(x) mean(x, na.rm=T),
    median = function(x) median(x, na.rm=T),
    sd     = function(x) sd(x, na.rm=T)
  )

dt[, summarise_all(.SD, my_funs), .SDcols = c("mpg", "hp"), by = 'cyl']

#    cyl mpg_mean   hp_mean mpg_median hp_median   mpg_sd    hp_sd
# 1:   6 19.74286 122.28571       19.7     110.0 1.453567 24.26049
# 2:   4 26.66364  82.63636       26.0      91.0 4.509828 20.93453
# 3:   8 15.10000 209.21429       15.2     192.5 2.560048 50.97689
IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38
1

Alternatively, you can use mapply. This has the added benefit of being applicable without a change in syntax with or without by.

> DT[, mapply(my.sum.fun, .SD), .SDcols=c("mpg","hp"), by=list(cyl)]
   cyl       V1   V2       V3        V4    V5       V6
1:   6 19.74286 19.7 1.453567 122.28571 110.0 24.26049
2:   4 26.66364 26.0 4.509828  82.63636  91.0 20.93453
3:   8 15.10000 15.2 2.560048 209.21429 192.5 50.97689

You may also be interested in SIMPLIFY = FALSE, which would return the data.table in a long form and preserve column names -

DT[, mapply(my.sum.fun, .SD, SIMPLIFY = FALSE), .SDcols=c("mpg","hp"), by=list(cyl)]
   cyl      mpg       hp
1:   6 19.74286 122.2857
2:   6     19.7      110
3:   6 1.453567 24.26049
4:   4 26.66364 82.63636
5:   4       26       91
6:   4 4.509828 20.93453
7:   8     15.1 209.2143
8:   8     15.2    192.5
9:   8 2.560048 50.97689
Ameya
  • 1,712
  • 1
  • 14
  • 29