1

I am trying to find lazy/easy ways of creating summary tables/data.frames from wide data.frames. Assume a following data.frame, but with many more columns so that specifying the column names takes a long time:

set.seed(2)
x <- data.frame(Rep = rep(1:3, 4), Temp = c(rep(10,6), rep(20,6)), 
pH = rep(c(rep(8.1, 3), rep(7.6, 3)), 2),
Var1 = rnorm(12, 5,2), Var2 = c(rnorm(6,4,1), rnorm(6,3,5)),
Var3 = rt(12, 20))
x[1:3] <- as.data.frame(apply(x[1:3], 2, function(x) as.factor(x)))

Now I can calculate summary statistics with plyr:

(mu <- ddply(x, .(Temp, pH), numcolwise(mean)))
(std <- ddply(x, .(Temp, pH), numcolwise(sd)))
(n  <- ddply(x, .(Temp, pH), numcolwise(length)))

But I am not able to figure out how to apply all of these functions at the same time:

ddply(x, .(Temp, pH), numcolwise(mean, sd, length))

I could of course merge various summary data.tables, but this would not be a "lazy / easy" way of doing this. I am looking for something general that I could apply in many instances. Something like this with the exception that it should be possible to generate with a single function:

xx <- merge(mu, std, by = c("Temp", "pH"), sotr = F)
colnames(xx) <- gsub("x", "mean", colnames(xx))
colnames(xx) <- gsub("y", "sd", colnames(xx))
xx <- merge(xx, n, by = c("Temp", "pH"), sotr = F)
colnames(xx)[(ncol(xx)-2):ncol(xx)] <-
paste0(colnames(xx)[(ncol(xx)-2):ncol(xx)], ".length")
xx <- xx[c("Temp", "pH", grep("Var1", colnames(xx), value = T),
grep("Var2", colnames(xx), value = T),
grep("Var3", colnames(xx), value = T))]
xx

  Temp  pH Var1.mean  Var1.sd Var1.length Var2.mean  Var2.sd Var2.length Var3.mean  Var3.sd Var3.length
1   10 7.6  4.281195 1.352194           3  3.534447 1.652884           3 0.1529616 1.076276           3
2   10 8.1  5.583853 2.491672           3  4.116622 1.478286           3 1.1611944 1.081301           3
3   20 7.6  5.840411 1.120549           3  6.907273 8.628021           3 0.1301949 1.764201           3
4   20 8.1  6.635154 2.232262           3  8.893188 4.208087           3 0.5509202 1.187431           3

Is this possible to do in R currently? Any advice would be greatly appreciated.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Mikko
  • 7,530
  • 8
  • 55
  • 92

2 Answers2

5

One way to do it with reshape2 and plyr. But you get results with variables in rows instead of columns :

library(reshape2)
library(plyr)
md <- melt(x[,-1], id.vars=c("Temp","pH"))
ddply(md, c("Temp", "pH", "variable"), summarize, mean=mean(value), sd=sd(value))

Which gives :

   Temp  pH variable      mean       sd
1    10 7.6     Var1 4.2811952 1.352194
2    10 7.6     Var2 3.5344474 1.652884
3    10 7.6     Var3 0.1529616 1.076276
4    10 8.1     Var1 5.5838533 2.491672
5    10 8.1     Var2 4.1166215 1.478286
6    10 8.1     Var3 1.1611944 1.081301
7    20 7.6     Var1 5.8404110 1.120549
8    20 7.6     Var2 6.9072734 8.628021
9    20 7.6     Var3 0.1301949 1.764201
10   20 8.1     Var1 6.6351538 2.232262
11   20 8.1     Var2 8.8931884 4.208087
12   20 8.1     Var3 0.5509202 1.187431

If you want your results in a wide format, you can use reshape :

md <- melt(x[,-1], id.vars=c("Temp","pH"))
result <- ddply(md, c("Temp", "pH", "variable"), summarize, mean=mean(value), sd=sd(value))
reshape(result, idvar=c("Temp","pH"), timevar="variable",direction="wide")

   Temp  pH mean.Var1  sd.Var1 mean.Var2  sd.Var2 mean.Var3  sd.Var3
1    10 7.6  4.281195 1.352194  3.534447 1.652884 0.1529616 1.076276
4    10 8.1  5.583853 2.491672  4.116622 1.478286 1.1611944 1.081301
7    20 7.6  5.840411 1.120549  6.907273 8.628021 0.1301949 1.764201
10   20 8.1  6.635154 2.232262  8.893188 4.208087 0.5509202 1.187431
juba
  • 47,631
  • 14
  • 113
  • 118
2

Base R's aggregate can actually handle this, but in a strange way:

(temp <- aggregate(. ~ Temp + pH, x, function(y) cbind(mean(y), sd(y), length(y))))
#   Temp  pH Rep.1 Rep.2 Rep.3   Var1.1   Var1.2   Var1.3   Var2.1   Var2.2   Var2.3
# 1   10 7.6     2     1     3 4.281195 1.352194 3.000000 3.534447 1.652884 3.000000
# 2   20 7.6     2     1     3 5.840411 1.120549 3.000000 6.907273 8.628021 3.000000
# 3   10 8.1     2     1     3 5.583853 2.491672 3.000000 4.116622 1.478286 3.000000
# 4   20 8.1     2     1     3 6.635154 2.232262 3.000000 8.893188 4.208087 3.000000
#      Var3.1    Var3.2    Var3.3
# 1 0.1529616 1.0762763 3.0000000
# 2 0.1301949 1.7642008 3.0000000
# 3 1.1611944 1.0813007 3.0000000
# 4 0.5509202 1.1874306 3.0000000
str(temp)
# 'data.frame':  4 obs. of  6 variables:
#  $ Temp: Factor w/ 2 levels "10","20": 1 2 1 2
#  $ pH  : Factor w/ 2 levels "7.6","8.1": 1 1 2 2
#  $ Rep : num [1:4, 1:3] 2 2 2 2 1 1 1 1 3 3 ...
#  $ Var1: num [1:4, 1:3] 4.28 5.84 5.58 6.64 1.35 ...
#  $ Var2: num [1:4, 1:3] 3.53 6.91 4.12 8.89 1.65 ...
#  $ Var3: num [1:4, 1:3] 0.153 0.13 1.161 0.551 1.076 ...

Notice that when we look at the structure of the output, we find that "Rep", "Var1", and so on are actually matrices. So, you can extract them and cbind them. But, that's somewhat tedious.

I had to do something similar once a while back, and I ended up just writing a basic wrapper around aggregate that looks like this.

aggregate2 <- function(data, aggs, ids, funs = NULL, ...) {
  if (identical(aggs, "."))
    aggs <- setdiff(names(data), ids)
  if (identical(ids, "."))
    ids <- setdiff(names(data), aggs)
  if (is.null(funs))
    stop("Aggregation function missing")
  myformula <- as.formula(
    paste(sprintf("cbind(%s)", paste(aggs, collapse = ", ")),
          " ~ ", paste(ids, collapse = " + ")))
  temp <- aggregate(
    formula = eval(myformula), data = data,
    FUN = function(x) sapply(seq_along(funs), 
                             function(z) eval(call(funs[z], quote(x)))), ...)
  temp1 <- do.call(cbind, lapply(temp[-c(1:length(ids))], as.data.frame))
  names(temp1) <- paste(rep(aggs, each = length(funs)), funs, sep = ".")
  cbind(temp[1:length(ids)], temp1)
}

Here's how you apply it to your example data.

(temp2 <- aggregate2(x, ".", c("Temp", "pH"), c("mean", "sd", "length")))
#   Temp  pH Rep.mean Rep.sd Rep.length Var1.mean  Var1.sd Var1.length Var2.mean
# 1   10 7.6        2      1          3  4.281195 1.352194           3  3.534447
# 2   20 7.6        2      1          3  5.840411 1.120549           3  6.907273
# 3   10 8.1        2      1          3  5.583853 2.491672           3  4.116622
# 4   20 8.1        2      1          3  6.635154 2.232262           3  8.893188
#    Var2.sd Var2.length Var3.mean  Var3.sd Var3.length
# 1 1.652884           3 0.1529616 1.076276           3
# 2 8.628021           3 0.1301949 1.764201           3
# 3 1.478286           3 1.1611944 1.081301           3
# 4 4.208087           3 0.5509202 1.187431           3

And, the structure is what we expect.

str(temp2)
# 'data.frame':  4 obs. of  14 variables:
#  $ Temp       : Factor w/ 2 levels "10","20": 1 2 1 2
#  $ pH         : Factor w/ 2 levels "7.6","8.1": 1 1 2 2
#  $ Rep.mean   : num  2 2 2 2
#  $ Rep.sd     : num  1 1 1 1
#  $ Rep.length : num  3 3 3 3
#  $ Var1.mean  : num  4.28 5.84 5.58 6.64
#  $ Var1.sd    : num  1.35 1.12 2.49 2.23
#  $ Var1.length: num  3 3 3 3
#  $ Var2.mean  : num  3.53 6.91 4.12 8.89
#  $ Var2.sd    : num  1.65 8.63 1.48 4.21
#  $ Var2.length: num  3 3 3 3
#  $ Var3.mean  : num  0.153 0.13 1.161 0.551
#  $ Var3.sd    : num  1.08 1.76 1.08 1.19
#  $ Var3.length: num  3 3 3 3

If you don't want to use the function, this is the part that specifically deals with working with the output of aggregate, as applied to the "temp" object we created at the start of this answer:

temp1 <- do.call(cbind, lapply(temp[-c(1:2)], as.data.frame))
funs <- c("mean", "sd", "length")
names(temp1) <- paste(rep(setdiff(names(temp), c("pH", "Temp")), 
                          each = length(funs)), funs, sep = ".")
cbind(temp[1:2], temp1)

Update: A more simple solution

It turns out that you can actually just do:

do.call(data.frame, 
        aggregate(. ~ Temp + pH, x, function(y) cbind(mean(y), sd(y), length(y))))

The downside here is that the names are less descriptive than the aggregate2 function I shared, but that can be fixed with a pretty straightforward call to names.

Community
  • 1
  • 1
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • Both of these answers give a solution for the problem. Juba's solution takes 3 rows, while yours only one. I think both fulfill the gruesome requirements of "lazy/easy" =). Both of them would deserve to be the accepted answer. I'll let people vote for the best solution and tick that one later. – Mikko Feb 07 '13 at 11:37
  • @Largh, just an FYI: not sure how big your actual data is. On this example and a few similar ones, sticking to `aggregate2` (which is really just `aggregate` in slight disguise) is about 3 times faster than a "plyr" + `reshape` approach ;) – A5C1D2H2I1M1N2O1R2T1 Feb 07 '13 at 12:19
  • 1
    @Largh, I [just found](http://stackoverflow.com/a/14759569/1270695) an even better solution: `do.call(data.frame, aggregate(. ~ Temp + pH, x, function(y) cbind(mean(y), sd(y), length(y))))` – A5C1D2H2I1M1N2O1R2T1 Feb 08 '13 at 05:39
  • To Ananda: could you edit it in you answer, please. I marked your answer as accepted. Thank you! – Mikko Feb 08 '13 at 07:37
  • @AnandaMahto how would I add in an na.action to your code? both the do.call and also your aggregate2 function. It is so close to what I need but I have Na's in my data – Salmo salar Jul 06 '14 at 13:00
  • 1
    @Salmosalar, if you're using the `do.call` version, just use `na.action` the same way you would normally use it with `aggregate`. I would recommend that over `aggregate2`. – A5C1D2H2I1M1N2O1R2T1 Jul 06 '14 at 13:29
  • Couldnt get it to work completely, however i adapted @juba answer perfectly. thanks! – Salmo salar Jul 16 '14 at 07:53