plyr ddply and summarise use in R

Question

Hi I want to avoid using loops and so want to use something from plyr to help solve my problem.

I would like to create a function that gets the sum of a specifically chosen column for each factor from a dataframe.

So if we have the following example data...

df <- data.frame(cbind(x=rnorm(100),y=rnorm(100),z=rnorm(100),f=sample(1:10,100, replace=TRUE))) 
df$f <- as.factor(df$f)

i.e. I would like something like:

foo <- function(df.obj,colname){
     some code
}

where the df.obj would be the df variable above and the colname argument could be any of x,y or z.

and I would like the output/result of the function to have a column of the unique factors (in the above case 1:10) and the sums of the values in column x for each factor.

I expect that the solution to be quite simple and would probably be using ddply or summarise somehow but can't work out how to do it so that i can have the column name as an argument.

Thanks

This appears a perfect fit for `data.table`. – mnel Sep 10 '12 at 02:55 — mnel, Sep 10 '12 at 02:55

ROLO · Accepted Answer · 2012-08-23T09:35:52.987

2

Is this what you're after?

> ddply(df, .(f), colwise(sum))
    f          x           y          z
1   1 -0.4190284  2.61101681  1.2280026
2   2  1.1063977  2.40006922  4.9550079
3   3  0.4498366 -4.00610558  0.9964754
4   4  1.9325488 -2.81241212 -3.1185574
5   5 -4.1077670 -1.01232884 -3.9852388
6   6 -1.0488003 -2.42924689  3.5273636
7   7  2.2999306  0.85930085 -0.6245167
8   8 -4.8105311 -6.81352238 -2.1223436
9   9 -2.8187083  5.03391770  1.6433896
10 10  5.1323666 -0.06192382  1.8978994

Edit: correct answer as supplied by TS:

foo <- function(df.obj,colname){ddply(df, .(f), colwise(sum))[,c("f",colname)]}

edited Aug 23 '12 at 09:35

answered Aug 23 '12 at 09:27

ROLO

4,183
25
41

1

not quite... although close enough for me to figure it out from what you wrote...`foo <- function(df.obj,colname){ddply(df, .(f), colwise(sum))[,c("f",colname)]}` I did not know about the colwise functions...thanks you... – h.l.m Aug 23 '12 at 09:33
out of curiosity if there was a second or thrid factor column and I wanted to sum according to that factor instead of f, how would I go about doing that...i.e. I would like to be adding the factor column name as an argument to the `foo` function – h.l.m Aug 23 '12 at 09:57
1

This: `foo <- function(df.obj, factorname, colname){ddply(df, factorname, colwise(sum,is.numeric))[,c(factorname,colname)]}` and call: `foo(df, "g", "y")` – ROLO Aug 23 '12 at 10:06

score 1 · Answer 2 · answered Sep 10 '12 at 02:52

This seems a perfect fit for data.table and the lapply(.SD,FUN) and .SDcols arguments

.SD is a data.table containing the Subset of x's Data for each group, excluding the group column(s).
.SDcols is a vector containing the names of the columns to which you wish to apply the function (FUN)

An example

Setup the data.table

library(data.table)
DT <- as.data.table(df)

The sums of x,y,z columns by f

DT[, lapply(.SD, sum), by = f, .SDcols = c("x", "y", "z")]

##      f       x       y       z
##  1:  4  4.8041  3.9788  1.2519
##  2:  2  1.1255 -0.8147  2.9053
##  3:  3  0.9699 -0.1550 -8.5876
##  4:  9  2.2685 -1.2734  1.0506
##  5:  5 -0.1282 -2.5512  5.0668
##  6: 10 -2.7397  0.5290 -0.3638
##  7:  1  2.9544 -3.1139 -1.3884
##  8:  8 -4.3488  0.6894  1.4195
##  9:  7  2.3152  0.6474  2.7183
## 10:  6 -0.1569  1.0142  0.9156

The sums of x, and z columns by f

DT[, lapply(.SD, sum), by = f, .SDcols = c("x", "z")]

##      f       x       z
##  1:  4  4.8041  1.2519
##  2:  2  1.1255  2.9053
##  3:  3  0.9699 -8.5876
##  4:  9  2.2685  1.0506
##  5:  5 -0.1282  5.0668
##  6: 10 -2.7397 -0.3638
##  7:  1  2.9544 -1.3884
##  8:  8 -4.3488  1.4195
##  9:  7  2.3152  2.7183
## 10:  6 -0.1569  0.9156

Examples calculating the mean

DT[, lapply(.SD, mean), by = f, .SDcols = c("x", "y", "z")]

##      f        x        y        z
##  1:  4  0.36955  0.30606  0.09630
##  2:  2  0.10232 -0.07407  0.26412
##  3:  3  0.07461 -0.01193 -0.66059
##  4:  9  0.15123 -0.08489  0.07004
##  5:  5 -0.01425 -0.28346  0.56298
##  6: 10 -0.21075  0.04069 -0.02799
##  7:  1  0.29544 -0.31139 -0.13884
##  8:  8 -0.54360  0.08617  0.17744
##  9:  7  0.38586  0.10790  0.45305
## 10:  6 -0.07844  0.50710  0.45782

DT[, lapply(.SD, mean), by = f, .SDcols = c("x", "z")]

##      f        x        z
##  1:  4  0.36955  0.09630
##  2:  2  0.10232  0.26412
##  3:  3  0.07461 -0.66059
##  4:  9  0.15123  0.07004
##  5:  5 -0.01425  0.56298
##  6: 10 -0.21075 -0.02799
##  7:  1  0.29544 -0.13884
##  8:  8 -0.54360  0.17744
##  9:  7  0.38586  0.45305
## 10:  6 -0.07844  0.45782

James Elderfield · Answer 3 · 2012-08-23T09:00:48.013

0

I haven't got enough rep to comment so will have to ask in answer form - why do you want to avoid using loops in R?

EDIT: Anyway using plyr I'd use count()

edited Aug 23 '12 at 09:00

answered Aug 23 '12 at 08:53

James Elderfield

2,389
1
34
39

also with regard to count...if you mean `count(df,"f")` (using my example) this merely counts the number of occurrences of each factor – h.l.m Aug 23 '12 at 09:17
Surely whatever function that does this will use a loop even if you're not explicitly making one yourself? Also I think I must have misunderstood as I thought you wanted the number of occurrences of each factor – James Elderfield Aug 23 '12 at 09:17
If not the number of occurrences perhaps you want to use the wt_var argument of count – James Elderfield Aug 23 '12 at 09:24
I think the `*ply` functions actually use a loop, but the loop is written in C, which makes it faster. However, `plyr` functions are not the most efficient. There are alternatives (such as `data.table`) that are faster in most situations. The main advantage of package `plyr` is its nice syntax. – Roland Aug 23 '12 at 11:14
2

Loops are typically not much slower than alternative methods such as plyr or base R's apply functions. The real reason loops tend to be looked down upon in R is that code in a loop is evaluated in its parent environment making it easy to unintentionally overwrite other existing variables or create variables which break later code. By contrast, plyr and apply functions are evaluated in a new environment, making such side effects impossible unless explicitly intended. – Michael Aug 31 '12 at 19:34

plyr ddply and summarise use in R

3 Answers3

An example