1

I need to create a moving average of a variable, that only considers previous observations of this variable, for each different data group.

I used to use a function, and then hack my variables a bit to make it work. Let me explain below.

I got this function from stackoverflow :

mav <- function(x,n) if(length(x) >= n)stats::filter(x,rep(1/n,n), sides=1) else NA_real_ 

Let's take the example of a moving average on 2 observations :

test = data.table("values" = c(1,2,3,4,5,6,7,8, 9,10,11,12), "category" = c(1,1,1,1,1,1,2,2,2,2,2,2))
test[, ma2 := as.numeric(mav(values, n = 2)), by = category]

This yields :

   values category  ma2
      1        1   NA
      2        1  1.5
      3        1  2.5
      4        1  3.5
      5        1  4.5
      6        1  5.5
      7        2   NA
      8        2  7.5
      9        2  8.5
     10        2  9.5
     11        2 10.5
     12        2 11.5

I want the 3rd observation of ma2 to be the mean of the last two observations of ma2. But here, the 3rd observation of ma2 is the mean of the 2nd and 3rd observation.

So I create "Vprev", another variable, that is the same as "Values", but takes the previous value of "Values" for each observation :

test[, vprev:= as.numeric(shift(values, 1L, type = "lag" )), by = category]

And then, I run the moving average ("TRUEma2") on the vprev variable instead :

test[, TRUEma2 := as.numeric(mav(vprev, n = 2)), by = category] 

values category  ma2 vprev TRUEma2
  1        1   NA    NA      NA
  2        1  1.5     1      NA
  3        1  2.5     2     1.5
  4        1  3.5     3     2.5
  5        1  4.5     4     3.5
  6        1  5.5     5     4.5
  7        2   NA    NA      NA
  8        2  7.5     7      NA
  9        2  8.5     8     7.5
 10        2  9.5     9     8.5
 11        2 10.5    10     9.5
 12        2 11.5    11    10.5

That used to work just fine, because my data sets were pretty small. But now I have to do this on multiple data sets that have about 2 to 3 millions observations. And I have to create moving averages for about 30 variables in each data set. The process I described takes up to 1min 40secs for each variable, so I calculated that I would need 25 hours to preprocess all my data sets...

I saw that what takes the most time is the part where I create a new variable that is the previous observation of another variable (approximately 1 minute) :

test[, vprev:= as.numeric(shift(values, 1L, type = "lag" )), by = category]

The moving average itself does not take a lot of time to compute.

I tried skipping this by putting a shift() in the moving average code line :

test[, TRUEma2 := as.numeric(mav(shift(values,1L,type = "lag), n = 2)), by = category]   

But it's not faster.

I also tried to modify the moving average function this way :

mav2 <- function(x,n) if(length(x) >= n+1)stats::filter(x-1,rep(1/n,n), sides=1) else NA_real_ 

But then the first value of x can take the value of the observation before it, that is not in the same data group/category.

     values category mav2
      1        1   NA
      2        1  0.5
      3        1  1.5
      4        1  2.5
      5        1  3.5
      6        1  4.5
      7        2   NA
      8        2  6.5
      9        2  7.5
     10        2  8.5
     11        2  9.5
     12        2 10.5

So here is my question : would it be possible to have a moving average function that is as fast the first one described above, but that only calculates the mean on previous observations ?

Thanks a lot for your help :)

EDIT : I tried the solutions proposed by lbusett and Icecreamtoucan, and although it worked on test data, Igot the following error message on the real data :

Error in[.data.table(toptrain2, ,:=(paste0("m3_", c("killsM")), : Type of RHS ('double') must match LHS ('logical'). To check and coerce would impact performance too much for the fastest cases. Either change the type of the target column, or coerce the RHS of := yourself (e.g. by using 1L instead of 1)

I was asked to give a sample of the actual data. Here is a dput (just an little extract of my data) :

structure(list(killsM = c(4L, 2L, 0L, 3L, 6L, 0L, 1L, 2L, 3L, 5L, 6L, 1L, 4L, 4L, 2L, 6L, 6L, 3L, 1L, 2L), soloKillsM = c(4L, 2L, 0L, 0L, 3L, 0L, 0L, 1L, 1L, 3L, 0L, 0L, 1L, 2L, 0L, 3L, 0L, 1L, 0L, 0L), deathsM = c(3L, 5L, 5L, 1L, 4L, 4L, 3L, 2L, 0L, 4L, 1L, 7L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 1L), assistsM = c(1L, 1L, 2L, 2L, 7L, 0L, 2L, 2L, 3L, 0L, 4L, 1L, 0L, 1L, 1L, 1L, 4L, 1L, 3L, 3L), killParticipationM = c(0.151515151515152, 0.0909090909090909, 0.125, 0.3125, 0.464285714285714, 0, 0.157894736842105, 0.210526315789474, 0.222222222222222, 0.185185185185185, 0.434782608695652, 0.0869565217391304, 0.2, 0.25, 0.130434782608696, 0.304347826086957, 0.4, 0.16, 0.181818181818182, 0.227272727272727), firstTowerKillM = c(0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), row.names = c(NA, 20L), class = "data.frame")

To me, it seems that the only difference with the test data is the name of the variables and the value of observations

JBR
  • 21
  • 6
  • @IceCreamToucan thanks for the suggestion, unfortunately it gives me the following error message : Error in `[.data.table`(test, , `:=`(ma, shift(rollmeanr(values, : Type of RHS ('double') must match LHS ('logical'). To check and coerce would impact performance too much for the fastest cases. Either change the type of the target column, or coerce the RHS of := yourself (e.g. by using 1L instead of 1) – JBR Dec 29 '18 at 16:37
  • I admit I don't know how to do so :/ . How would you rephrase this : test[, ma := shift(rollmeanr(values, 4, na.pad = T)), category] By itself, "ma" is a new and different column from "values", no ? – JBR Dec 29 '18 at 16:53
  • I tried another name than "ma", and received the same error message. I'll update right away the main post with a dput – JBR Dec 29 '18 at 17:26
  • Which is your "category" variable on the real dataset? The solution I proposed seems to fail if you have less than 2 observations within any group of the category variable (probably because rollmean gives NULL on a 1-element array). – lbusett Dec 29 '18 at 17:58
  • The "category" variable is "sumchamp", but is absent from the dput. Sumchamp is just a 6 figure number, I use it like an ID. It's 100% true that some groups have less than 2 observations. To remove them, i would usually compute the moving average first, and then remove all observations with "NA" values. – JBR Dec 29 '18 at 18:01
  • then you'll have to remove them beforehand. You can use table(as.factor(test$sumchamp)) to identify the "bad" IDS and subset the input DT accordingly – lbusett Dec 29 '18 at 18:09
  • if you sort your input data, you can do `shift` by all observations and then just set to NA first observation in group, might be faster than shift by group. – jangorecki Dec 31 '18 at 07:15

3 Answers3

1

What about shifting the results instead of the input values? Something like this (using rollmean from package zoo):

library(data.table)
library(zoo)
test = data.table("values" = c(1,2,3,4,5,6,7,8, 9,10,11,12), 
                  "category" = c(1,1,1,1,1,1,2,2,2,2,2,2))
test[, paste0("ravg_", c("values")) := shift(lapply(
  .SD, rollmean, k = 2, na.pad = TRUE, align = "right"), 1), 
  .SDcols = c("values"), by = category]

    values category ravg_values
 1:      1        1          NA
 2:      2        1          NA
 3:      3        1         1.5
 4:      4        1         2.5
 5:      5        1         3.5
 6:      6        1         4.5
 7:      7        2          NA
 8:      8        2          NA
 9:      9        2         7.5
10:     10        2         8.5
11:     11        2         9.5
12:     12        2        10.5

You can also adapt it to multiple columns easily (see https://stackoverflow.com/a/31482551/6871135)

lbusett
  • 5,801
  • 2
  • 24
  • 47
  • I just tried, unfortunately it gave me this error message : Error in `[.data.table`(test, , `:=`(paste0("ravg_", c("values")), : Type of RHS ('double') must match LHS ('logical'). To check and coerce would impact performance too much for the fastest cases. Either change the type of the target column, or coerce the RHS of := yourself (e.g. by using 1L instead of 1) – JBR Dec 29 '18 at 16:28
  • 1
    I am not seeing the error on the test dataset. Could you share a subset of a"real"one? – lbusett Dec 29 '18 at 16:44
  • For convenience, I just added a dput in my main post. – JBR Dec 29 '18 at 17:28
  • 1
    starting from 1.12.0 (or current devel) you can use `frollmean` instead of `zoo::rollmean`, you should get significant speed up when using on multiple variables as it will compute rolling mean in parallel – jangorecki Dec 31 '18 at 07:12
  • `lapply` is then redundant. – jangorecki Dec 31 '18 at 07:19
0

I think you could speed this up by putting shift in the function you are using to calculate the average, e.g.

mav_shift <- function(x,n) if(length(x) >= n)stats::filter(shift(x),rep(1/n,n), sides=1) else NA_real_

By my quick test, this very slightly increases the time to run the function, and removes the step of creating a new variable. Please test to make sure it works as expected, but the results from your sample data appear the same.

EDIT and quicker solution:

mav_shift <- function(x,n) {
  if(length(x) >= n) { 
    stats::filter(shift(x),rep(1/n,n), sides=1) 
  } else NA_real_

result <- by(test$values, test$category, mav_shift, n=2, simplify=T)
test$new <- as.vector(unlist(result))
E. Brown
  • 396
  • 2
  • 6
  • Thanks for your suggestion ! Yes, it's faster. It now takes 43 secs to calculate the moving average of a variable, whereas it took about a minute before just to create the new variable. The moving average itself (without making the shift() modification) took 38 secs. Unfortunately it's sill about 10 hours of total data preprocessing to run that for all variables in all data set. Would you have an idea on making it faster ? – JBR Dec 29 '18 at 15:22
  • You could try this `by(test$values, test$category, mav_shift, n=2, simplify=T)` – E. Brown Dec 29 '18 at 15:59
  • It looks awesome thanks ! But how do I create a new variable out of this in the test data table ? – JBR Dec 29 '18 at 16:32
  • `result <- by(test$values, test$category, mav_shift, n=2, simplify=T)` `test$new <- as.vector(unlist(result))` – E. Brown Dec 29 '18 at 18:44
  • `by` is terrible slow – jangorecki Dec 31 '18 at 07:16
  • Yes this answer takes about 66% of the time of the original, but the answer by @lbusett is much faster. – E. Brown Dec 31 '18 at 14:28
0

You can combine the functions shift and rollmeanr in data.table and zoo packages, respectively, as follow.

library(data.table)
library(zoo)
test = data.table(values = 1:12, category = rep(1:2, each = 6))
test[, mg2 := shift(rollmeanr(values, 2, fill = NA)), category]

   values category      mg2
1:      1        1       NA
2:      2        1       NA
3:      3        1      1.5
4:      4        1      2.5
5:      5        1      3.5
6:      6        1      4.5
7:      7        2       NA
8:      8        2       NA
9:      9        2      7.5
10:     10       2      8.5
11:     11       2      9.5
12:     12       2     10.5