Does anyone have an idea or suggestion on how to increase the efficiency of the following example of code eating up all my ram using a "kind-of" double rolling window?
First, I go through a simple example defining the problem, with a full MWE (implementation) at the bottom of this post.
First, consider the following "random" test vector (usually of length >25000):
A <- c(1.23,5.44,6.3,8.45,NaN,3.663,2.63,1.32,6.623,234.6,252.36)
A
is sectioned into a "kind-of" train and test set, both with rolling windows. In this MWE, a train-set start of length 4
and a test set length of 2
are considered (usually of length >200). So initially, the following values are part of the train and test set:
train_1 <- A[1:4]
test_1 <- A[5:6]
Next, I want to subtract test_1
from train_1
at each possible consecutive location of train_1
(hence the first rolling window), generating the run_1_sub
matrix.
run_1_sub <- matrix(NaN,3,2)
run_1_sub[1,] <- train_1[1:2] - test_1
run_1_sub[2,] <- train_1[2:3] - test_1
run_1_sub[3,] <- train_1[3:4] - test_1
Afterwards, I want to find on each row in run_1_sub
the sum of each row divided by the number of entries in each row not being NaN
.
run_1_sum <-
sapply(1:3, function(x) {
sum(run_1_sub[x,], na.rm = T) / sum(!is.na(run_1_sub[x,]))
})
In the next step, the "kind-of" train and test sets are updated by increasing their order from A
by one (hence the second rolling window):
train_2 <- A[2:5]
test_2 <- A[6:7]
As previously, test_2
is subtracted at each possible location in train_2
and run_2_sub
and run_2_sum
are computed. This procedure is continued until the test set represents the last two values of A and finally I end (in this MWE) up with 6 run_sum
matrices. My implementation, however, is very slow, and I was wondering whether anyone could help me to increase it's efficiency?
Here's my implementation:
# Initialization
library(zoo)
#rm(list = ls())
A <- c(1.23, 5.44, 6.3, 8.45, NaN, 3.663, 2.63, 1.32, 6.623, 234.6, 252.36) # test vector
train.length <- 4
test.length <- 2
run.length <- length(A) - train.length - test.length + 1
# Form test sets
test.sets <- sapply(1:run.length, function(x) {
A[(train.length + x):(train.length + test.length + x - 1)]
})
# Generate run_sub_matrices
run_matrix <- lapply(1:run.length, function(x) {
rollapply(A[x:(train.length + x - 1)], width = test.length, by = 1,
function(y) {
y - test.sets[, x]
})
})
# Genereate run_sum_matrices
run_sum <- sapply(1:length(run_matrix), function(x) {
rowSums(run_matrix[[x]], na.rm = T) / apply(run_matrix[[x]], 1, function(y) {
sum(!is.na(y))})
})
Naturally, the following initialization set-up slows the generation of run_sum
and run_sub
significantly down:
A <- runif(25000)*400
train.length <- 400
test.length <- 200
Here, the elapsed time for generating run_sub
is 120.04s and for run_sum
28.69s respectively.
Any suggestions on how to increase and improved the speed and code?