1

I am trying to count the number of positive events over a 12 month rolling window.

I can create 365 rows of missing data per year and use zoo::rollapply to sum the number of events per 365 rows of data, but my data frame is really big and I want to do this on a bunch of variables, so this takes forever to run.

I can get the correct output with this:

data <- data.frame(id = c("a","a","a","a","a","b","b","b","b","b"),
                   date = c("20-01-2011","20-04-2011","20-10-2011","20-02-2012",
                            "20-05-2012","20-01-2013","20-04-2013","20-10-2013",
                            "20-02-2014","20-05-2014"),
                   event = c(0,1,1,1,0,1,0,0,1,1))
library(lubridate)
library(dplyr)
library(tidyr)
library(zoo)

data %>%
group_by(id) %>%
mutate(date = dmy(date),
       cumsum = cumsum(event)) %>%
complete(date = full_seq(date, period = 1), fill = list(event = 0)) %>%
mutate(event12 = rollapplyr(event, width = 365, FUN = sum, partial = TRUE)) %>%
drop_na(cumsum)

Which is this:

 id     date       event cumsum event12
 <fct>  <date>     <dbl>  <dbl>   <dbl>
 a      2011-01-20     0      0       0
 a      2011-04-20     1      1       1
 a      2011-10-20     1      2       2
 a      2012-02-20     1      3       3
 a      2012-05-20     0      3       2
 b      2013-01-20     1      1       1
 b      2013-04-20     0      1       1
 b      2013-10-20     0      1       1
 b      2014-02-20     1      2       1
 b      2014-05-20     1      3       2

But want to see if there's a more efficient way, as in how would I make the width in rollyapply count up dates rather than count up rows.

M--
  • 25,431
  • 8
  • 61
  • 93
lokes
  • 75
  • 8

1 Answers1

0

This can be done without filling out the missing dates using a complex self join and a single sql statement after converting the dates to Date class:

library(sqldf)

data2 <- transform(data, date = as.Date(date, "%d-%m-%Y"))

sqldf("select a.*, sum(b.event) as event12
  from data2 as a
  left join data2 as b on a.id = b.id and b.date between a.date - 365 and a.date
  group by a.rowid
  order by a.rowid")

giving:

   id       date event event12
1   a 2011-01-20     0       0
2   a 2011-04-20     1       1
3   a 2011-10-20     1       2
4   a 2012-02-20     1       3
5   a 2012-05-20     0       2
6   b 2013-01-20     1       1
7   b 2013-04-20     0       1
8   b 2013-10-20     0       1
9   b 2014-02-20     1       1
10  b 2014-05-20     1       2
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341