Understanding plyr's ddply function

Question

I am learning R and don't understand a section of the below function. In the below function what exactly is count=length(address) doing? Is there another way to do this?

crime_dat = ddply(crime, .(lat, lon), summarise, count = length(address))

I am not familiar with `ddply` but I would guess `summarise` aggregates a number for unique groups of `lat` and `long`. This number is called count and it is calculated by `length(address)` which should be the number of addresses within each lat/lon group. — Alex, Sep 10 '14 at 00:54
consider using the package `dplyr` instead... if I am right in my above interpretation the equivalent `dplyr` command would be: `crime_dat <- crime %>% group_by(lat, lon) %>% summarise( count = length(address) )` — Alex, Sep 10 '14 at 00:56
I believe it'd be `crime %>% group_by(lat, lon) %>% tally()` or `crime %>% group_by(lat, lon) %>% summarise(count=n())` if one wants to do more summary stats. That `crime` data is in the `ggmap` pkg (if folks want to test) and that exact `crime_dat =` line is from: http://stackoverflow.com/questions/24273444/creating-leaflet-heatmaps-in-r-and-shiny-using-rcharts — hrbrmstr, Sep 10 '14 at 01:28
`with(crime, tapply(address, list(lat, long) , length)`. The length() of a vector is just the number of 'address'-items in each cross-category of `lat` and `lon`. (I'm assuming you know what they are.) The length() of a dataframe on the other hand is the number columns. — IRTFM, Sep 10 '14 at 02:18

score 8 · Accepted Answer · answered Sep 10 '14 at 01:20

The plyr library has two very common "helper" functions, summarize and mutate.

Summarise is used when you want to discard irrelevant data/columns, keeping only the levels of the grouping variable(s) and the specific and the summary functions of those groups (in your example, length).

Mutate is used to add a column (analogous to transform in base R), but without discarding anything. If you run these two commands, they should illustrate the difference nicely.

library(plyr)
ddply(mtcars, .(cyl), summarise, count = length(mpg))
ddply(mtcars, .(cyl), mutate, count = length(mpg))

In this example, as in your example, the goal is to figure out how many rows there are in each group. When using ddply like this with summarise, we need to pick a function that takes a single column (vector) as an argument, so length is a good choice. Since we're just counting rows / taking the length of the vector, it doesn't matter which column we pass to it. Alternatively, we could use nrow, but for that we have to pass a whole data.frame, so summarise won't work. In this case it saves us typing:

ddply(mtcars, .(cyl), nrow)

But if we want to do more, summarise really shines

ddply(mtcars, .(cyl), summarise, count = length(mpg),
      mean_mpg = mean(mpg), mean_disp = mean(disp))

Is there another way to do this?

Yes, many other ways.

I'd second Alex's recommendation to use dplyr for things like this. The summarize and mutate concepts are still used, but it works faster and results in more readable code.

Other options include the data.table package (also a great option), tapply() or aggregate() in base R, and countless other possibilities.

Understanding plyr's ddply function

1 Answers1