I am learning R and don't understand a section of the below function. In the below function what exactly is count=length(address)
doing? Is there another way to do this?
crime_dat = ddply(crime, .(lat, lon), summarise, count = length(address))
I am learning R and don't understand a section of the below function. In the below function what exactly is count=length(address)
doing? Is there another way to do this?
crime_dat = ddply(crime, .(lat, lon), summarise, count = length(address))
The plyr
library has two very common "helper" functions, summarize
and mutate
.
Summarise is used when you want to discard irrelevant data/columns, keeping only the levels of the grouping variable(s) and the specific and the summary functions of those groups (in your example, length
).
Mutate is used to add a column (analogous to transform
in base R), but without discarding anything. If you run these two commands, they should illustrate the difference nicely.
library(plyr)
ddply(mtcars, .(cyl), summarise, count = length(mpg))
ddply(mtcars, .(cyl), mutate, count = length(mpg))
In this example, as in your example, the goal is to figure out how many rows there are in each group. When using ddply
like this with summarise
, we need to pick a function that takes a single column (vector) as an argument, so length
is a good choice. Since we're just counting rows / taking the length of the vector, it doesn't matter which column we pass to it. Alternatively, we could use nrow
, but for that we have to pass a whole data.frame, so summarise
won't work. In this case it saves us typing:
ddply(mtcars, .(cyl), nrow)
But if we want to do more, summarise really shines
ddply(mtcars, .(cyl), summarise, count = length(mpg),
mean_mpg = mean(mpg), mean_disp = mean(disp))
Is there another way to do this?
Yes, many other ways.
I'd second Alex's recommendation to use dplyr
for things like this. The summarize
and mutate
concepts are still used, but it works faster and results in more readable code.
Other options include the data.table
package (also a great option), tapply()
or aggregate()
in base R, and countless other possibilities.