Running variance when time window is not constant

Question

I am trying to calculate a moving variance with a window of, let's say 4 years, for each one of the names A, B and C. The data is weekly:

> head(data1, 17)
         date name       value
1  1985-01-01    A -0.44008233
2  1985-01-01    B          NA #Observe that there are some NA's
3  1985-01-01    C  0.38682496
4  1985-01-08    A  0.41806540
5  1985-01-08    B -0.05460831
6  1985-01-08    C -0.52051435
7  1985-01-15    A  1.25769395
8  1985-01-15    B  0.80272053
9  1985-01-15    C -0.34501742
10 1985-01-22    A -0.43401839
11 1985-01-22    B  0.91113966
12 1985-01-22    C  1.07131717
13 1985-01-29    A -1.55395857
14 1985-01-29    B -0.43281709
15 1985-01-29    C  0.98034779
16 1985-02-05    A  1.70557396
17 1985-02-05    B  0.44688788

My approach so far is to dcast the data and then run rollapply() (zoo) column wise with a moving window of 192 = 4*12*4:

v <- dcast(data1, date ~ name, value.var = "value")
var <- rollapply(v[-1], width=4*12*4, var, fill=NA, by.column = T)
var <- cbind(v$date, var)
var[,1] <- as.Date(var[,1])

However, I realized that for some months I have four observations (eg. 7, 14, 21, 28 February) and for some I have five weekly observations (eg.1, 8, 15, 22 and 29 of January), so using a window of 4 years * 12 months * 4 weeks observations is not correct. I was thinking to add these extra observations on the time window (width), but I am not sure how (or if it is even possible), since these change depending on how many 5-weeks-per-month and how many 4-weeks-per-month observations there are inside the time window.

Additionally, I would like to have a NA when there are NA observations within the moving time window (this is handled automatically by var() anyway I think) and also I would like to ignore zero observations. For that I was thinking I could remove the zeroes before running the running variance function and then somehow put them back in in the end. So you can ignore that part, unless, of course, you have some good idea to do it in one step.

Sample data:

set.seed(486)
date <- rep(seq(as.Date("1985-01-01"), as.Date("2010-01-1"), by="weeks"), each=3)
N <- length(date)
name <- c("A","B","C")
value <- rnorm(N)
i<-which(value %in% sample(value, 25)) ;i
j<-which(value %in% sample(value, 150)) ;j
value[i] <- NA
value[j] <- 0
data1 <- data.frame(date, name, value)

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

2

I don't think that you should your rolling window should coincide of the number of weeks in your data. It is just a window over data to smooth it. One idea is to do the something with 2 windows and take the average value:

library(data.table)
library(zoo)
setDT(data1)[,var := {
           v1 <- rollapplyr(value,width=4*12*4, var, fill=N)
           v2 <- rollapplyr(value,width=4*12*5, var, fill=N)
           (v1+v2)/2},  name]

PS: Here I am using data.table because it is suitable for split(per group) and rebind operations.

Edit

You can also transform you weekly data to a daily one then you can calculate more accurate a roll in this basis. The idea is to create a daily index and merge it with your original data. This will create a new data.table with missing values. You replace the missing values with first non missing values using na.locf.

library(data.table)
library(zoo)
ID <- 
data.table(
  date = seq(as.Date("1985-01-01"), as.Date("2010-01-1"), by="days"))
setkey(ID,date)

setDT(data1)[,date:=as.Date(date)][, 
        {
          merge(ID,.SD,all.x=TRUE)[,value := na.locf(value)]
        },
        
        name]

edited Jun 20 '20 at 09:12

Community

1
1

answered Jun 18 '15 at 12:24

agstudy

119,832
17
199
261

That's interesting. We could take it one step further maybe I think, by using a weighted mean, instead of a simple average, after counting how many 5-per-month and 4-per-month weekly observations there exist. I am more interested in calculating the exact solution though. I start to suspect that using the indexes of the observations directly is not the best way to go. We should be able to use the dates somehow. Still not sure. – Per Jun 19 '15 at 10:43
@Per very easy to calculate the weights. I can show you this , but if you want an exact solution I think you should approximate (interpolate) a daily time series from your weekly one. – agstudy Jun 19 '15 at 10:48
Yes, I think I know how I could calculate the weights too, thanks^^ As for making a daily series, I am a bit worried that then I would have too much data to handle it efficiently in R, since my data-set is already big. Excluding weekends, it would grow 5x, right? – Per Jun 19 '15 at 10:53
@Per I add a code to transform weekly.daily series. – agstudy Jun 19 '15 at 11:31

G. Grothendieck · Answer 2 · 2015-06-18T13:04:22.850

2

4 years has 208 weeks plus 5 days so it is not evenly divisible into weeks. If we use 209 weeks then we are only out by 2 days in 4 years so let us try that.

First convert data1 to "zoo" class splitting the data into separate columns according to the value of the 2nd column. z will have one column for each of A, B and C. Then define a variance function which excludes zeros and use it with rollapplyr

library(zoo)
z <- read.zoo(data1, split = 2) # 1305 x 3 
var0 <- function(x) var(x[x != 0])
r <- rollapplyr(z, 209, var0)

Leaving it as a zoo object may be sufficient but this would convert it to a 4 column data frame with columns Index, A, B and C:

fortify.zoo(r)

edited Jun 18 '15 at 13:04

answered Jun 18 '15 at 12:52

G. Grothendieck

254,981
17
203
341

Okay, this is a good approximation. At the moment I am trying to think if we can use endpoints() on the dates to obtain an exact solution. – Per Jun 19 '15 at 10:46
An exact solution does not exist because, as explained, 4 years ia not a multiple of weeks. – G. Grothendieck Jun 19 '15 at 12:04
Yes, I understand that, that's why I am trying to think if we could change approach and instead of using a number of observations for the moving window, use specific indexes from `endpoints()` (or equivalently their distance). That would mean a different pair of points to define the moving window for each variance to be calculated and makes it a bit complicated – Per Jun 19 '15 at 12:17
That doesn't help at all. 4 years is still not a multiple of 7 days so either the last week will be partly out and partly in the 4 year span (which is what using 209 weeks does) or else there will be one week with no days in the 4 week span which is what 208 weeks does. Whether you specify the weeks by endpoints or not is immaterial. – G. Grothendieck Jun 19 '15 at 12:55

Running variance when time window is not constant

2 Answers2

Edit