I need to create a moving average of a variable, that only considers previous observations of this variable, for each different data group.
I used to use a function, and then hack my variables a bit to make it work. Let me explain below.
I got this function from stackoverflow :
mav <- function(x,n) if(length(x) >= n)stats::filter(x,rep(1/n,n), sides=1) else NA_real_
Let's take the example of a moving average on 2 observations :
test = data.table("values" = c(1,2,3,4,5,6,7,8, 9,10,11,12), "category" = c(1,1,1,1,1,1,2,2,2,2,2,2))
test[, ma2 := as.numeric(mav(values, n = 2)), by = category]
This yields :
values category ma2
1 1 NA
2 1 1.5
3 1 2.5
4 1 3.5
5 1 4.5
6 1 5.5
7 2 NA
8 2 7.5
9 2 8.5
10 2 9.5
11 2 10.5
12 2 11.5
I want the 3rd observation of ma2 to be the mean of the last two observations of ma2. But here, the 3rd observation of ma2 is the mean of the 2nd and 3rd observation.
So I create "Vprev", another variable, that is the same as "Values", but takes the previous value of "Values" for each observation :
test[, vprev:= as.numeric(shift(values, 1L, type = "lag" )), by = category]
And then, I run the moving average ("TRUEma2") on the vprev variable instead :
test[, TRUEma2 := as.numeric(mav(vprev, n = 2)), by = category]
values category ma2 vprev TRUEma2
1 1 NA NA NA
2 1 1.5 1 NA
3 1 2.5 2 1.5
4 1 3.5 3 2.5
5 1 4.5 4 3.5
6 1 5.5 5 4.5
7 2 NA NA NA
8 2 7.5 7 NA
9 2 8.5 8 7.5
10 2 9.5 9 8.5
11 2 10.5 10 9.5
12 2 11.5 11 10.5
That used to work just fine, because my data sets were pretty small. But now I have to do this on multiple data sets that have about 2 to 3 millions observations. And I have to create moving averages for about 30 variables in each data set. The process I described takes up to 1min 40secs for each variable, so I calculated that I would need 25 hours to preprocess all my data sets...
I saw that what takes the most time is the part where I create a new variable that is the previous observation of another variable (approximately 1 minute) :
test[, vprev:= as.numeric(shift(values, 1L, type = "lag" )), by = category]
The moving average itself does not take a lot of time to compute.
I tried skipping this by putting a shift() in the moving average code line :
test[, TRUEma2 := as.numeric(mav(shift(values,1L,type = "lag), n = 2)), by = category]
But it's not faster.
I also tried to modify the moving average function this way :
mav2 <- function(x,n) if(length(x) >= n+1)stats::filter(x-1,rep(1/n,n), sides=1) else NA_real_
But then the first value of x can take the value of the observation before it, that is not in the same data group/category.
values category mav2
1 1 NA
2 1 0.5
3 1 1.5
4 1 2.5
5 1 3.5
6 1 4.5
7 2 NA
8 2 6.5
9 2 7.5
10 2 8.5
11 2 9.5
12 2 10.5
So here is my question : would it be possible to have a moving average function that is as fast the first one described above, but that only calculates the mean on previous observations ?
Thanks a lot for your help :)
EDIT : I tried the solutions proposed by lbusett and Icecreamtoucan, and although it worked on test data, Igot the following error message on the real data :
Error in
[.data.table(toptrain2, ,
:=(paste0("m3_", c("killsM")), :
Type of RHS ('double') must match LHS ('logical'). To check and coerce would impact performance too much for the fastest cases. Either change the type of the target column, or coerce the RHS of := yourself (e.g. by using 1L instead of 1)
I was asked to give a sample of the actual data. Here is a dput (just an little extract of my data) :
structure(list(killsM = c(4L, 2L, 0L, 3L, 6L, 0L, 1L, 2L, 3L,
5L, 6L, 1L, 4L, 4L, 2L, 6L, 6L, 3L, 1L, 2L), soloKillsM = c(4L,
2L, 0L, 0L, 3L, 0L, 0L, 1L, 1L, 3L, 0L, 0L, 1L, 2L, 0L, 3L, 0L,
1L, 0L, 0L), deathsM = c(3L, 5L, 5L, 1L, 4L, 4L, 3L, 2L, 0L,
4L, 1L, 7L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 1L), assistsM = c(1L,
1L, 2L, 2L, 7L, 0L, 2L, 2L, 3L, 0L, 4L, 1L, 0L, 1L, 1L, 1L, 4L,
1L, 3L, 3L), killParticipationM = c(0.151515151515152, 0.0909090909090909,
0.125, 0.3125, 0.464285714285714, 0, 0.157894736842105, 0.210526315789474,
0.222222222222222, 0.185185185185185, 0.434782608695652, 0.0869565217391304,
0.2, 0.25, 0.130434782608696, 0.304347826086957, 0.4, 0.16, 0.181818181818182,
0.227272727272727), firstTowerKillM = c(0L, 0L, 0L, 0L, 1L, 0L,
0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), row.names = c(NA,
20L), class = "data.frame")
To me, it seems that the only difference with the test data is the name of the variables and the value of observations