How to increase efficiency of double rolling window operation?

Question

Does anyone have an idea or suggestion on how to increase the efficiency of the following example of code eating up all my ram using a "kind-of" double rolling window?

First, I go through a simple example defining the problem, with a full MWE (implementation) at the bottom of this post.

First, consider the following "random" test vector (usually of length >25000):

A <- c(1.23,5.44,6.3,8.45,NaN,3.663,2.63,1.32,6.623,234.6,252.36)

A is sectioned into a "kind-of" train and test set, both with rolling windows. In this MWE, a train-set start of length 4 and a test set length of 2 are considered (usually of length >200). So initially, the following values are part of the train and test set:

train_1 <- A[1:4]
test_1 <- A[5:6]

Next, I want to subtract test_1 from train_1 at each possible consecutive location of train_1 (hence the first rolling window), generating the run_1_sub matrix.

run_1_sub <- matrix(NaN,3,2)
run_1_sub[1,] <- train_1[1:2] - test_1
run_1_sub[2,] <- train_1[2:3] - test_1
run_1_sub[3,] <- train_1[3:4] - test_1

Afterwards, I want to find on each row in run_1_sub the sum of each row divided by the number of entries in each row not being NaN.

run_1_sum <-
    sapply(1:3, function(x) {
       sum(run_1_sub[x,], na.rm = T) / sum(!is.na(run_1_sub[x,]))
})

In the next step, the "kind-of" train and test sets are updated by increasing their order from A by one (hence the second rolling window):

train_2 <- A[2:5] 
test_2 <- A[6:7]

As previously, test_2 is subtracted at each possible location in train_2 and run_2_sub and run_2_sum are computed. This procedure is continued until the test set represents the last two values of A and finally I end (in this MWE) up with 6 run_sum matrices. My implementation, however, is very slow, and I was wondering whether anyone could help me to increase it's efficiency?

Here's my implementation:

# Initialization
library(zoo) 
#rm(list = ls())
A <- c(1.23, 5.44, 6.3, 8.45, NaN, 3.663, 2.63, 1.32, 6.623, 234.6, 252.36) # test vector
train.length <- 4
test.length <- 2
run.length <- length(A) - train.length - test.length + 1
# Form test sets
test.sets <- sapply(1:run.length, function(x) {
A[(train.length + x):(train.length + test.length + x - 1)]
})
# Generate run_sub_matrices
run_matrix <- lapply(1:run.length, function(x) {
    rollapply(A[x:(train.length + x - 1)], width = test.length, by = 1,
        function(y) {
            y - test.sets[, x]
            })
})
# Genereate run_sum_matrices
run_sum <- sapply(1:length(run_matrix), function(x) {
rowSums(run_matrix[[x]], na.rm = T) / apply(run_matrix[[x]], 1,  function(y) {
sum(!is.na(y))})
})

Naturally, the following initialization set-up slows the generation of run_sum and run_sub significantly down:

A <- runif(25000)*400
train.length <- 400
test.length <- 200

Here, the elapsed time for generating run_sub is 120.04s and for run_sum 28.69s respectively.

Any suggestions on how to increase and improved the speed and code?

score 3 · Answer 1 · answered Jun 11 '17 at 16:35

Usually the first two steps of code optimization in R are:

Do less;
Use vectorization.

We will come through both of these steps. Let's agree to note x as input vector (A in your example).

The key functional unit in your problem can be formulated as follows: given train_start (start index of subset of train. We will use word 'train' for this subset), test_start (start index of test) and test_length (length of test) compute:

train_inds <- train_start + 0:(test_length-1)
test_inds <- test_start + 0:(test_length-1)
run_diff <- x[train_inds] - x[test_inds]
sum(run_diff, na.rm = TRUE) / sum(!is.na(run_diff))

This unit is invoked many times and so is computation of sums and !is.na. We will do less: instead of computing many times differences with their sums we precompute cumulative sums ones and use this data. See 'Preparatory computations' in run_mean_diff.

res now contains needed sum of differences of x_mod (which is a copy of x but with 0 instead of NAs and NaNs). We should now subtract all overused elements, i.e. those which we shouldn't use in sums because the respective element in other set is NA or NaN. While computing this information we will also compute the denominator. See 'Info about extra elements' in run_mean_diff.

The beauty of this code is that train_start, test_start and test_length can now be vectors: ith element of each vector is treated as single element for our task. This is vectorization. Our job is now to construct these vectors suited for our task. See function generate_run_data.

Presented code is using much less RAM, doesn't need extra zoo dependency and is considerably faster original on small train_length and test_length. On big *_lengths also faster but not very much.

One of the next steps might be writing this code using Rcpp.

The code:

run_mean_diff <- function(x, train_start, test_start, test_length) {
  # Preparatory computations
  x_isna <- is.na(x)
  x_mod <- ifelse(x_isna, 0, x)
  x_cumsum <- c(0, cumsum(x_mod))

  res <- x_cumsum[train_start + test_length] - x_cumsum[train_start] -
    (x_cumsum[test_start + test_length] - x_cumsum[test_start])

  # Info about extra elements
  extra <- mapply(
    function(cur_train_start, cur_test_start, cur_test_length) {
      train_inds <- cur_train_start + 0:(cur_test_length-1)
      test_inds <- cur_test_start + 0:(cur_test_length-1)

      train_isna <- x_isna[train_inds]
      test_isna <- x_isna[test_inds]

      c(
        # Correction for extra elements
        sum(x_mod[train_inds][test_isna]) -
              sum(x_mod[test_inds][train_isna]),
        # Number of extra elements
        sum(train_isna | test_isna)
      )
    },
    train_start, test_start, test_length, SIMPLIFY = TRUE
  )

  (res - extra[1, ]) / (test_length - extra[2, ])
}

generate_run_data <- function(n, train_length, test_length) {
  run_length <- n - train_length - test_length + 1
  num_per_run <- train_length - test_length + 1

  train_start <- rep(1:num_per_run, run_length) +
    rep(0:(run_length - 1), each = num_per_run)
  test_start <- rep((train_length + 1):(n - test_length + 1),
                    each = num_per_run)

  data.frame(train_start = train_start,
             test_start = test_start,
             test_length = rep(test_length, length(train_start)))
}

A <- c(1.23, 5.44, 6.3, 8.45, NaN, 3.663,
       2.63, 1.32, 6.623, 234.6, 252.36)
train_length <- 4
test_length <- 2
run_data <- generate_run_data(length(A), train_length, test_length)

run_sum_new <- matrix(
  run_mean_diff(A, run_data$train_start, run_data$test_start,
                run_data$test_length),
  nrow = train_length - test_length + 1
)

This is very nice, not only because it is clearly an improvement in terms of RAM usage (and processing time), but also because of your nice documentation! Thanks also for showing me the use of `mapply`, which was what bugged me for a short period. I've not used `C++` before, so until now, this is good enough for me. — Jonas, Jun 12 '17 at 08:39
Simply out of curiosity - how did you end up generating the `res` formula based on using x_cumsum? — Jonas, Jun 13 '17 at 14:11
By making transformations: `(train1 - test1) + (train2 - test2) = (train1 + train2) - (test1 + test2)` (for `test_length == 2`). The first sum is computed as difference of `x_cumsum` elements at end and start of `train`. The second - at end and start of `test`. It is a common trick: if you want to compute multiple times sums of consecutive elements it is better to precompute cumulative sums and use them. — echasnovski, Jun 13 '17 at 17:31

Joshua Ulrich · Accepted Answer · 2017-06-12T22:42:38.483

The reason your code uses so much RAM is because you keep a lot of intermediate objects, mainly all the elements in run_matrix. And profiling via Rprof shows that most of the time is spent in rollapply.

The easiest and simplest way to avoid all the intermediate objects is to use a for loop. It also makes the code clear. Then you just need to replace the call to rollapply with something faster.

The function you want to apply to each rolling subset is simple: subtract the test set. You can use the stats::embed function to create the matrix of lags, and then take advantage of R's recycling rules to subtract the test vector from each column. The function I created is:

calc_run_sum <- function(A, train_length, test_length) {
  run_length <- length(A) - train_length - test_length + 1L
  window_size <- train_length - test_length + 1L

  # Essentially what embed() does, but with column order reversed
  # (part of my adaptation of echasnovski's correction)
  train_lags <- 1L:test_length +
                rep.int(1L:window_size, rep.int(test_length, window_size)) - 1L
  dims <- c(test_length, window_size)  # lag matrix dims are always the same

  # pre-allocate result matrix
  run_sum <- matrix(NA, window_size, run_length)

  # loop over each run length
  for (i in seq_len(run_length)) {
    # test set indices and vector
    test_beg <- (train_length + i)
    test_end <- (train_length + test_length + i - 1)

    # echasnovski's correction
    #test_set <- rep(test_set, each = train_length - test_length + 1)
    #lag_matrix <- embed(A[i:(test_beg - 1)], test_length)
    #run_sum[,i] <- rowMeans(lag_matrix - test_set, na.rm = TRUE)

    # My adaptation of echasnovski's correction
    # (requires train_lags object created outside the loop)
    test_set <- A[test_beg:test_end]
    train_set <- A[i:(test_beg - 1L)]
    lag_matrix <- train_set[train_lags]
    dim(lag_matrix) <- dims
    run_sum[,i] <- colMeans(lag_matrix - test_set, na.rm = TRUE)
  }
  run_sum
}

Now, for some benchmarks. I used the following input data:

library(zoo) 
set.seed(21)
A <- runif(10000)*200
train.length <- 200
test.length <- 100

Here are the timings for your original approach:

system.time({
  run.length <- length(A) - train.length - test.length + 1
  # Form test sets
  test.sets <- sapply(1:run.length, function(x) {
    A[(train.length + x):(train.length + test.length + x - 1)]
  })
  # Generate run_sub_matrices
  run_matrix <- lapply(1:run.length, function(x) {
    rm <- rollapply(A[x:(train.length + x - 1)], width = test.length, by = 1,
                    FUN = function(y) { y - test.sets[, x] })
  })
  # Genereate run_sum_matrices
  run_sum <- sapply(run_matrix, function(x) {
    rowSums(x, na.rm = T) / apply(x, 1,  function(y) {
  sum(!is.na(y))})
  })
})
#    user  system elapsed 
#  19.868   0.104  19.974

And here are the timings for echasnovski's approach:

system.time({
  run_data <- generate_run_data(length(A), train.length, test.length)

  run_sum_new <- matrix(
    run_mean_diff(A, run_data$train_start, run_data$test_start,
                  run_data$test_length),
    nrow = train.length - test.length + 1
  )
})
#    user  system elapsed 
#  10.552   0.048  10.602

And the timings from my approach:

system.time(run_sum_jmu <- calc_run_sum(A, train.length, test.length))
#    user  system elapsed 
#   1.544   0.000   1.548

The output from all 3 approaches are identical.

identical(run_sum, run_sum_new)
# [1] TRUE
identical(run_sum, run_sum_jmu)
# [1] TRUE

Indeed, it is definitely speeding up the process and also represents easier code to interpret. Thank you very much. — Jonas, Jun 12 '17 at 14:30
Second: I think there is misguided use of data structures here. For example, I tried `calc_run_sum(A, 20, 3)` and it gave me an error about mismatching lengths. There seems to be two problems: 1. Initialization of `run_sum` should be `run_sum <- matrix(NA, train_length - test_length + 1, run_length)` (different number of rows); — echasnovski, Jun 12 '17 at 20:07
2. In initialization of `test_set` should be added `test_set <- rep(test_set, each = train_length - test_length + 1)`. Because subtraction of vector from matrix is done by columns and `stats::embed` creates 'lags' in columns. After these two fixes your answer seems to work properly. — echasnovski, Jun 12 '17 at 20:09
@echasnovski: thanks a ton for the corrections! I'd upvote your answer again, if I could. My adaptation of your correction is also faster than my incorrect solution (runs in ~1s on the same machine). — Joshua Ulrich, Jun 12 '17 at 22:43
Yes indeed, now it seems to work correctly and even faster. I've also noticed some missed correction about matrix column order in `embed` but it's good that you've already fixed it :) — echasnovski, Jun 13 '17 at 05:28
@echasnovski and @JoshuaUlrich, thanks for your contributions. As a side note: I previously had to change/flip the column order of the `lag_matrix` when subtracting `test_set` so that I also was capable to compute the absolute difference; however, this new solution makes it possible to skip this addition and only insert `abs()` into the `colMeans` operation. — Jonas, Jun 13 '17 at 06:42

How to increase efficiency of double rolling window operation?

2 Answers2