23

I've been attempting to understand what and how plyr works through trying different variables and functions and seeing what results. So I'm more looking for an explanation of how plyr works than specific fix it answers. I've read the documentation but my newbie brain is still not getting it.

Some data and names:

mydf<- data.frame(c("a","a","b","b","c","c"),c("e","e","e","e","e","e")
                  ,c(1,2,3,10,20,30),
                  c(5,10,20,20,15,10))
colnames(mydf)<-c("Model", "Class","Length", "Speed")
mydf

Question 1: Summarise versus Transform Syntax

So if I Enter: ddply(mydf, .(Model), summarise, sum = Length+Length)

I get:

`Model ..1
1     a   2
2     a   4
3     b   6
4     b  20
5     c  40
6     c  60

and if I enter: ddply(mydf, .(Model), summarise, Length+Length) I get the same result.

Now if use transform: ddply(mydf, .(Model), transform, sum = (Length+Length))

I get:

  Model Class Length Speed sum
1     a     e      1     5   2
2     a     e      2    10   4
3     b     e      3    20   6
4     b     e     10    20  20
5     c     e     20    15  40
6     c     e     30    10  60

But if I state it like the first summarise : ddply(mydf, .(Model), transform, (Length+Length))

  Model Class Length Speed
1     a     e      1     5
2     a     e      2    10
3     b     e      3    20
4     b     e     10    20
5     c     e     20    15
6     c     e     30    10

So why does adding "sum =" make a difference?

Question 2: Why don't these work?

ddply(mydf, .(Model), sum, Length+Length) #Error in function (i) : object 'Length' not found

ddply(mydf, .(Model), length, mydf$Length) #Error in .fun(piece, ...) : 

2 arguments passed to 'length' which requires 1

These examples are more to show that somewhere I'm fundamentally not understanding how to use plyr.

Any anwsers or explanations are appreciated.

rsgmon
  • 1,892
  • 4
  • 23
  • 35

3 Answers3

22

I find that when I'm having trouble "visualizing" how any of the functional tools in R work, that the easiest thing to do is browser a single instance:

ddply(mydf, .(Model), function(x) browser() )

Then inspect x in real-time and it should all make sense. You can then test out your function on x, and if it works you're golden (barring other groupings being different than your first x).

Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235
19

The syntax is:

ddply(data.frame, variable(s), function, optional arguments)

where the function is expected to return a data.frame. In your situation,

  • summarise is a function that will transparently create a new data.frame, with the results of the expression that you provide as further arguments (...)

  • transform, a base R function, will transform the data.frames (first split by the variable(s)), adding new columns according to the expression(s) that you provide as further arguments. These need to be named, that's just the way transform works.

If you use other functions than subset, transform, mutate, with, within, or summarise, you'll need to make sure they return a data.frame (length and sum don't), or at the very least a vector of appropriate length for the output.

baptiste
  • 75,767
  • 19
  • 198
  • 294
  • 1
    Also, I believe the first set of examples in the OP is simply the difference in default behavior between `summmarise` and `transform` if you neglect to include a tag like `val = ` in the expression. `summarise` will apparently supply its own name, whereas `transform` appears to ignore it. – joran Jul 06 '12 at 22:25
4

The way I understand the ddply(... , .(...) , summarise, ...) operations are are designed to reduce the number of rows to match the number of distinct combinations inside the .(...) grouping variables. So for your first example, this seemed natural:

ddply(mydf, .(Model), summarise, sL = sum(Length)
  Model sL
1     a  3
2     b 13
3     c 50

OK. Seems to work for me (not a regular plyr user). The transform operations on the other hand I understand to be making new columns of the same length as the dataframe. That was what your first transform call accomplished. Your second one (a failure) was:

ddply(mydf, .(Model), transform, (Length+Length))

That one did not create a new name for the operation that was performed, so there was nothing new assigned in the result. When you added sum=(Length+Length), there suddenly was a name available, (and the sum function was not used). It's generally a bad idea to use the names of function for column names.

On question two, I think that the .fun argument needs to be a plyr-function or something that makes sense applied to a (split) dataframe as a whole rather any old function. There is no sum.data.frame function. But 'nrow' or 'ncol' do make sense. You can even get 'str' to work in that position. The length function applied to a dataframe gives the number of columns:

 ddply(mydf, .(Model), length )  # all 4's
IRTFM
  • 258,963
  • 21
  • 364
  • 487