3

I'm trying to apply a function (say standard deviation) in a rolling window, by category:

I have the following data:

cat = c("A", "A", "A", "A", "B", "B", "B", "B") 
year = c(1990, 1991, 1992, 1993, 1990, 1991, 1992, 1993) 
value = c(2, 3, 5, 6, 8, 9, 4, 5) 
df = data.frame(cat, year, value)

I would like to create a new column (say sd) that estimates the standard deviation over two year window by cat.

Here's the result I'm thinking of:

enter image description here

Any advice on how to achieve this?

JasonMArcher
  • 14,195
  • 22
  • 56
  • 52
ec0n0micus
  • 1,075
  • 2
  • 12
  • 19

1 Answers1

2

It can be done by using rollapply from the zoo package:

library(zoo)

cat = c("A", "A", "A", "A", "B", "B", "B", "B") 
year = c(1990, 1991, 1992, 1993, 1990, 1991, 1992, 1993) 
value = c(2, 3, 5, 6, 8, 9, 4, 5) 
df = data.frame(cat, year, value)

df$stdev <- unlist(by(df, df$cat, function(x) {
  c(NA, rollapply(x$value, width=2, sd))
}), use.names=FALSE)

print(df)
##   cat year value     stdev
## 1   A 1990     2        NA
## 2   A 1991     3 0.7071068
## 3   A 1992     5 1.4142136
## 4   A 1993     6 0.7071068
## 5   B 1990     8        NA
## 6   B 1991     9 0.7071068
## 7   B 1992     4 3.5355339
## 8   B 1993     5 0.7071068

You can also do it with ddply if you'd rather use plyr functions than by:

df$stdev <- ddply(df, .(cat), summarise, 
                  stdev=c(NA, rollapply(value, width=2, sd)))$stdev

As a lark, I did a system.time (multiple times) comparison of the above two methods and also the ave method pointed out by @thelatemail in the comment thread below this answer (starting with a "fresh" copy of the data frame).

df <- data.frame(cat, year, value)
system.time(df$stdev <- with(df, ave(value, cat, FUN=function(x) c(NA, rollapply(x, width=2, sd)))))

df <- data.frame(cat, year, value)
system.time(df$stdev <- unlist(by(df, df$cat, function(x) c(NA, rollapply(x$value, width=2, sd))), use.names=FALSE))

df <- data.frame(cat, year, value)
system.time(df$stdev <- ddply(df, .(cat), summarise, stdev=c(NA, rollapply(value, width=2, sd)))$stdev)

Both the ave and by methods take:

   user  system elapsed 
  0.002   0.000   0.002 

and the ddply version takes:

   user  system elapsed 
  0.004   0.000   0.004 

Not that speed is really an issue here, but it looks like the ave and by versions are the most efficient ways to do this.

Community
  • 1
  • 1
hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
  • thank you, but it's not quite right - i want those standard deviations to be grouped by "cat". can i use ddply or something similar here? – ec0n0micus Apr 01 '14 at 02:00
  • 1
    No need for `unlist(by(...))` - `ave` is probably more appropriate: `with(df, ave(value, cat, FUN=function(x) c(NA, rollapply(x, width=2, sd)) ))` – thelatemail Apr 01 '14 at 02:40
  • Nice. Didn't think of `ave`. It does `interaction(...); split(x, g) <- lapply(split(x, g), FUN)` under the covers so it does alot under the covers. – hrbrmstr Apr 01 '14 at 02:46
  • 2
    A slight simplification could be had by using `rollapplyr` with the `fill=NA` argument: `ave(df$value, df$cat, FUN = function(v) rollapplyr(v, 2, sd, fill = NA))` – G. Grothendieck Apr 01 '14 at 03:23
  • [@G](http://stackoverflow.com/users/516548/g-grothendieck) Now I want to go back and add that to the time tests! That's prbly the best all around solution. – hrbrmstr Apr 01 '14 at 03:33