2

I'm looking for help optimizing my code to get rid of loops and increase computational speed. I am pretty new to the field and to R. I run component wise gradient boosting regressions on a linear time series model with a rolling window. I use the coefficients from the regression y on X for each window to predict the next "out of window" observation of y. (Later I will evaluate forecast accuracy)

My data are 1560 different time series (including lags of orginal series) with about 540 observations (Data Frame of dimension 540x1560)

I looked into rollapply but couldn't get it to work. Especially I don't know how to predict yhat for each window (each iteration).

#Define windows size
w=100
##Starting Loop, rolling the window by one observation per iteration 
#Predicting the next dependent variable y_hat(w+i) with the data from the "pseudo" most recent observation
for (i in 1:(nrow(df_all)-w)){
glm1 <- glmboost(fm, data=df_all[i:(w-1+i), ], center=TRUE, control=boost_control(mstop = 100, trace=TRUE)) 
ls_yhat[[i]] <- predict(glm1, newdata = df_all[w-1+i,])
}

Any tips appreciated (takes forever to run on my laptop)!

PS: I am also looking into using multicore or parallel packages. Especially b/c I'll use cross-validation for the stopping criterion later on. But I just starte looking into it. However, any tips on that are appreciated too!

Edit: Minimal working example using bulit-in data (not time series though):

library("mboost") ## load package
data("bodyfat", package = "TH.data") ## load data

##Initializing List for coefficients DFs
ls_yhat <- list()

#Define windows size
w=30
##Starting Loop, rolling the window by one observation per iteration 
##Predicting the next dependent variable y_hat(w+i) with the data from the "pseudo" most recent observation
for (i in 1:(nrow(bodyfat)-w)){
  glm1 <- glmboost(DEXfat~., data=bodyfat[i:(w-1+i), ], center=TRUE, control=boost_control(mstop = 15, trace=TRUE)) 
  ls_yhat[[i]] <- predict(glm1, newdata = bodyfat[(w-1+i),])
  i
}
SimonCW
  • 356
  • 3
  • 11
  • Not really sure if its possible but have you looked at apply()? – Dinesh.hmn Sep 09 '16 at 13:11
  • 1
    It would be really good if you could either provide the data via `dput` or use a builtin data set for your example so that we can reproduce this. – Hack-R Sep 09 '16 at 13:12
  • 1
    Just fyi, your "out of window" observation is actually within the training dataset for each i iteration. Your current code is most likely mis-specified. All the apply family is just another form of loop. `glmboost` is likely the bottleneck since your dataset dimension is actually minuscule in the grand scheme of things. You can parallelize your loop with `package:foreach` or `parallel::mclapply`. – Vlo Sep 09 '16 at 14:10
  • @Vlo Right, to predict the first observation after the window, I use the estimated parameters and the last observation from the window (on which the model was trained). But I think that should be all right. Or am I missing something? – SimonCW Sep 09 '16 at 14:27
  • @Vlo: concerning the speed. You are right its only about 400 iterations max and apply wouldn't save me that much time. Still, I'd like to learn how to use it correctly. Thanks for recommending the packages! I will certainly look into them. Parallelizing the loop would probably save a lot of time. – SimonCW Sep 09 '16 at 14:28
  • If that is your prediction scheme, it wouldn't be called "out of window" . I'm not saying that apply will/will not result in a speed up. I'm saying the *apply` family is effectively a for loop in R. – Vlo Sep 09 '16 at 14:50

1 Answers1

0

As Vlo rightly mentionend, the bottleneck is the boosting alogrithm. I used package:foreach and doParallel which more than halved the running time. I wanted to share my solution.

library("mboost") ## load package
data("bodyfat", package = "TH.data") ## load data
library("foreach")
library("doParallel")

##Register backend for parallel execution
registerDoParallel()

##Initializing List for coefficients DFs
ls_yhat <- list()

#Define windows size
w=30
##Starting Loop, rolling the window by one observation per iteration 
##Predicting the next dependent variable y_hat(w+i) with the data from the "pseudo" most recent observation
ls_yhat <- foreach (i = 1:(nrow(bodyfat)-w), .packages='mboost') %dopar%{
  glm1 <- glmboost(DEXfat~., data=bodyfat[i:(w-1+i), ], center=TRUE, control=boost_control(mstop = 15, trace=TRUE)) 
  ls_yhat[[i]] <- predict(glm1, newdata = bodyfat[(w-1+i),])
}
SimonCW
  • 356
  • 3
  • 11