0

I have a data table named features with the columns nightNo, HR, motion and angle. I'd like to get the rolling variance of the previous 600 points of the HR, motion and angle per nightNo. I've come up with the following function to do this:

features <- data.table(nightNo=c(1,1,1,1,1,1,1,2,2,2,2,2,2,2),
                       HR=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14),
                       motion=c(14,13,12,11,10,9,8,7,6,5,4,3,2,1),
                       angle=c(2,4,6,8,10,12,14,16,18,20,22,24,26,28))

# For the example I'll use a window of 6 instead of 600
window = 6
features[, c("HR_Variance", "motion_Variance", "angle_Variance") := 
       list(rollapply(HR, window, var, partial=TRUE, align = "right"), 
            rollapply(motion, window, var, partial=TRUE, align = "right"), 
            rollapply(angle, window, var, partial=TRUE, align = "right")), by=nightNo ]

#    nightNo HR motion angle HR_Variance motion_Variance angle_Variance
# 1:       1  1     14     2          NA              NA             NA
# 2:       1  2     13     4    0.500000        0.500000       2.000000
# 3:       1  3     12     6    1.000000        1.000000       4.000000
# 4:       1  4     11     8    1.666667        1.666667       6.666667
# 5:       1  5     10    10    2.500000        2.500000      10.000000
# 6:       1  6      9    12    3.500000        3.500000      14.000000
# 7:       1  7      8    14    3.500000        3.500000      14.000000
# 8:       2  8      7    16          NA              NA             NA
# 9:       2  9      6    18    0.500000        0.500000       2.000000
# 10:      2 10      5    20    1.000000        1.000000       4.000000
# 11:      2 11      4    22    1.666667        1.666667       6.666667
# 12:      2 12      3    24    2.500000        2.500000      10.000000
# 13:      2 13      2    26    3.500000        3.500000      14.000000
# 14:      2 14      1    28    3.500000        3.500000      14.000000

The result is correct, but since I have a large dataset it runs forever. I've also made other similair features that use runmeans and sapplys over the same 600 window per nightNo and they run in a reasonable time, which makes me think either rollapply or the variance function is very slow. Is there a way to make this code more efficient, possibly by changing the var or the rollapply function?

Henk
  • 73
  • 8
  • 1
    maybe use `RcppRoll::roll_var` and also check out https://github.com/Rdatatable/data.table/issues/2778 – chinsoon12 Jun 28 '18 at 09:02
  • your expected output does'nt match the given data – Roman Jun 28 '18 at 15:54
  • @chinsoon12 Looks like roll_var sadly doesn't have parital implemented yet which I do require. – Henk Jun 28 '18 at 19:26
  • @Jimbou I missed the 6 in angle, thanks for the headsup. – Henk Jun 28 '18 at 19:26
  • 1
    You can always append NAs in the front while calling roll_var – chinsoon12 Jun 28 '18 at 21:48
  • @chinsoon12 appending the NAs and then using the rollapply on the small amount of NAs seems to do the trick in a reasonable time, since the roll_var seems te be over 10 times faster than the rollapply. Thanks! – Henk Jun 30 '18 at 09:50

1 Answers1

0

I have no idea what rollaplly is doing but I produce this output on the given sample data using a parallel tidyverse which could be faster

library(cumstats)
library(tidyverse)
library(furrr)

plan(multiprocess)
window <- 6

features %>% 
  nest(-nightNo) %>% 
  mutate(data=future_map(data,~mutate_at(.,vars(HR, motion,angle), 
                funs(var=cumvar(.)[c(1:window,rep(window,length(.)-length(1:window)))])))) %>% 
  unnest()
# A tibble: 14 x 7
   nightNo    HR motion angle HR_var motion_var angle_var
     <dbl> <dbl>  <dbl> <dbl>  <dbl>      <dbl>     <dbl>
 1       1     1     14     2  NA         NA        NA   
 2       1     2     13     4   0.5        0.5       2   
 3       1     3     12     6   1          1         4   
 4       1     4     11     8   1.67       1.67      6.67
 5       1     5     10    10   2.5        2.5      10   
 6       1     6      9    12   3.5        3.5      14   
 7       1     7      8    14   3.5        3.5      14   
 8       2     8      7    16  NA         NA        NA   
 9       2     9      6    18   0.5        0.5       2   
10       2    10      5    20   1          1         4   
11       2    11      4    22   1.67       1.67      6.67
12       2    12      3    24   2.5        2.5      10   
13       2    13      2    26   3.5        3.5      14   
14       2    14      1    28   3.5        3.5      14 
Roman
  • 17,008
  • 3
  • 36
  • 49
  • This seems to be slightly faster on a window of 6, but on a window of 600 it gives the error "Error in mutate_impl(.data, dots) : Evaluation error: invalid 'times' argument.". I'm not familiar with any of the libraries and functions used, any idea what could be causing this? – Henk Jun 28 '18 at 19:14
  • @Henk Has each `nightNo` 600 entries? – Roman Jun 29 '18 at 07:32
  • No, most have have and a couple have less. – Henk Jun 29 '18 at 12:08
  • 1
    OK. If the window size is longer then the entries then the error is produced. You have to set a windows size smaller than the minimal number of entries per `nightNo` using this approach. I can fix that, but you have to provide some sample data illustrating that problem. – Roman Jun 29 '18 at 13:22
  • I used the roll_var and appended NA's as chinsoon12 suggested, and then used my original rollapply on the NA's. The roll_var is over 10 times faster on a sample with with a window of 6 and seems to do the whole data with a 600 window in a reasonable so I think I'll go with that. Thanks for the reply and offering to fix the error. – Henk Jun 30 '18 at 09:48