0

this is a simple question, and I am sure it is easily solvable with either tapply, apply, or by, etc. However, I am still relatively new to this, and I would like to ask for advice.

The problem:

I have a data frame with say 5 columns. Columns 4 and 5 are factors, say. For each factor in column 5, I want to execute a function over columns 1:3 for each group in my column 5. This is, in principle, easily doable. However, I want to have the output as a nice table, and I want to learn how to do this in an elegant way, which is why I would like to ask you here.

Example:

 df <- data.frame(x1=1:6, x2=12:17, x3=3:8, y=1:2, f=1:3)

Now, the command

 by(df[,1:3], df$y, sum)

would give me the sum based on each factor level in y, which is almost what I want. Two additional steps are needed: one is to do this for each factor level in f. This is almost trivial. I could easily wrap lapply around the above command and I would get what I want, except this: I want to generate a table with the results, and maybe even use it to generate a heatmap.

Hence: is there an easy and more elegant way to do this and to generate a matrix with corresponding output? This seems like an everyday-task for data scientists, which is why I suspect that there is an existing built-in solution...

Thanks for any help or any hint, no matter how small!

coffeinjunky
  • 11,254
  • 39
  • 57

1 Answers1

1

You can use the reshape2 and plyr packages to accomplish this.

library(plyr)
df2 <- ddply(df, .(y, f), sum)

and then to turn it into a f by y matrix:

library(reshape2)
acast(df2, f ~ y, value.var = "V1")
mengeln
  • 331
  • 1
  • 3
  • Do you really want to sum the `y` and `f` values as well? – thelatemail Aug 21 '13 at 01:23
  • Thanks for the solution! I have not yet fully understood it, since I have never worked with `plyr` before, but it seems promising at least. – coffeinjunky Aug 21 '13 at 10:40
  • @thelatemail Think of `f` as city, and `y` as year. For each year, I want to have each sum of `x_i` in each city. Think of `x1` as number of car accidents, `x2` as bike accidents, etc. This means the factors themselves are meaningless, and I just want the number of accidents for each type for each city. I should probably have specified this in my question to make the problem easier to understand. Sorry about this. – coffeinjunky Aug 21 '13 at 10:46
  • 1
    @user2378649 - in that case, `aggregate` should do it: `aggregate(. ~ y + f, data=df, sum)` or `aggregate(cbind(x1,x2,x3) ~ y + f, data=df, sum)` to explicitly specify the `xN` columns. – thelatemail Aug 21 '13 at 10:59