5

First some background info, which is probably more interesting on stats.stackexchange:

In my data analysis I try to compare the performance of different machine learning methods on time series data (regression, not classification). So for example I have trained a Boosting trained model and compare this with a Random Forest trained model (R package randomForest).

I use time series data where the explanatory variables are lagged values of other data and the dependent variable.

For some reason the Random Forest severely underperforms. One of the problems I could think of is that the Random Forest performs a sampling step of the training data for each tree. If it does this to time series data, the autoregressive nature of the series is completely removed.

To test this idea, I would like to replace the (bootstrap) sampling step in the randomForest() function with a so called block-wise bootstrap step. This basically means I cut the training set into k parts, where k<<N, where each k-th part is in the original order. If I sample these k parts, I could still benefit from the 'randomness' in the Random Forest, but with the time series nature left largely intact.

Now my problem is this:

To achieve this I would normally copy the existing function and edit the desired step/lines.

randomForest2 <- randomForest()

But the randomForest() function seems to be a wrapper for another wrapper for deeper underlying functions. So how can I edit the actual bootstrap step in the randomForest() function and still run the rest of the function regularly?

Siguza
  • 21,155
  • 6
  • 52
  • 89
DaReal
  • 597
  • 3
  • 10
  • 24
  • you may want to go grab the original function and start from there - something like `randomForest <- randomForest:::randomForest.formula`. How big is the data? Maybe you could remove the time series element bit by including the lagged values as independant variables, then using a standard approach? – RichAtMango Aug 17 '15 at 19:02
  • 1
    I'm not sure the `strata` argument will achieve what you want, probably not. Otherwise, the resampling all takes place in compiled C (regression) or Fortran (classification) code, so to modify that you'd need to download the source, alter it and recompile. – joran Aug 17 '15 at 19:49
  • What if you tried it with `replace==FALSE` and `sampsize` equal to your training data? That would get rid of the bootstrap entirely and you'd basically end up with a set of bagged trees. – Tchotchke Aug 17 '15 at 19:54
  • Thank you for your suggestions, I will check them out tomorrow and update accordingly. – DaReal Aug 18 '15 at 20:33

2 Answers2

2

So for me the solution wasn't editing the existing randomForest function. Instead I coded the block-wise bootstrap myself, using the split2 function given by Soren H. Welling to create the blocks. Once I had my data block-wise bootstrapped, I looked for a package (rpart) that performed just a single Regression Tree and aggregated it myself (taking the means).

The result for my actual data is a slightly but consistently improved version over the normal random forest performance in terms of RMSPE.

For the code below the performance seems to be a coin-toss.

Taking Soren's code as an example it looks a bit like this:

library(randomForest)
library(doParallel) #parallel package and mclapply is better for linux
library(rpart)

#parallel backend ftw
nCPU = detectCores()
cl = makeCluster(nCPU)
registerDoParallel(cl)

#simulated time series(y) with time roll and lag=1
timepoints=1000;var=6;noise.factor=.2

#past to present orientation    
y = sin((1:timepoints)*pi/30) * 1000 +
  sin((1:timepoints)*pi/40) * 1000 + 1:timepoints
y = y+rnorm(timepoints,sd=sd(y))*noise.factor
plot(y,type="l")

#convert to absolute change, with lag=1
dy = c(0,y[-1]-y[-length(y)]) # c(0,t2-t1,t3-t2,...)

#compute lag 
dy = dy + rnorm(timepoints)*sd(dy)*noise.factor #add noise
dy = c(0,y[-1]-y[-length(y)]) #convert to absolute change, with lag=1 
dX = sapply(1:40,function(i){
  getTheseLags = (1:timepoints) - i
  getTheseLags[getTheseLags<1] = NA #remove before start timePoints
  dx.lag.i = dy[getTheseLags]
})
dX[is.na(dX)]=-100 #quick fix of when lag exceed timeseries
pairs(data.frame(dy,dX[,1:5]),cex=.2)#data structure

#make train- and test-set
train=1:600
dy.train = dy[ train]
dy.test  = dy[-train]
dX.train  = dX[ train,]
dX.test   = dX[-train,]

#classic rf
rf = randomForest(dX.train,dy.train,ntree=500)
print(rf)

#like function split for a vector without mixing
split2 = function(aVector,splits=31) {
  lVector = length(aVector)
  mod = lVector %% splits
  lBlocks = rep(floor(lVector/splits),splits)
  if(mod!=0) lBlocks[1:mod] = lBlocks[1:mod] + 1
  lapply(1:splits,function(i) {
    Stop  = sum(lBlocks[1:i])
    Start = Stop - lBlocks[i] + 1
    aVector[Start:Stop]
  })
}  


#create a list of block-wise bootstrapped samples
aBlock <- list()
numTrees <- 500
splits <- 40
for (ttt in 1:numTrees){

  aBlock[[ttt]] <- unlist(
    sample(
      split2(1:nrow(dX.train),splits=splits),
      splits,
      replace=T
    )
  )
}

#put data into a dataframe so rpart understands it
df1 <- data.frame(dy.train, dX.train)
#perform regression trees for Blocks
rfBlocks = foreach(aBlock = aBlock,
                   .packages=("rpart")) %dopar% {
                     dBlock = df1[aBlock,] 
                     rf = predict( rpart( dy.train ~., data = dBlock, method ="anova" ), newdata=data.frame(dX.test) ) 
                   } 

#predict test, make results table
#use rowMeans to aggregate the block-wise predictions
results = data.frame(predBlock   = rowMeans(do.call(cbind.data.frame, rfBlocks)),
                     true=dy.test,
                     predBootstrap = predict(rf,newdata=dX.test)
                     )
plot(results[,1:2],xlab="OOB-CV predicted change",
     ylab="trueChange",
     main="black bootstrap and blue block train")
points(results[,3:2],xlab="OOB-CV predicted change",
       ylab="trueChange",
       col="blue")

#prediction results
print(cor(results)^2)


stopCluster(cl)#close cluster
DaReal
  • 597
  • 3
  • 10
  • 24
1

To directly alter sampling of randomForest(type="reggression"): Learn basic C programming, download from cran source code randomForest.4.6-10.tar.gz, (if windows install Rtools), (if OSX install Xcode), install and open Rstudio, start new project, choose package, unpack ...tar.gz into folder, look into src folder, open regrf.c, checkout line 151 and 163. Write new sampling strategy, press occationally Ctrl+Shift+B package to rebuild/compile and overwrite randomForest library, correct stated compile errors, test occasionally if package still works, spend some hours figuring out the old uninformative code, perhaps change description file, namespace file, and some few other references so the package will change name to randomForestMod, rebuild, voilla.

A more easy way not changing the randomForest is described below. Any trees with the same feature inputs can be patched together with the function randomForest::combine, so you can design your sampling regime in pure R code. I thought it actually was a bad idea, but for this very naive simulation it actually works with similar/slightly better performance! Remember to not predict the absolute target value, but instead a stationary derivative such as relative change, absolute change etc. If predicting the absolute value, RF will fall back to predicting tomorrow is something pretty close of today. Which is a trivial useless information.

edited code [22:42 CEST]

library(randomForest)
library(doParallel) #parallel package and mclapply is better for linux

#parallel backend ftw
nCPU = detectCores()
cl = makeCluster(nCPU)
registerDoParallel(cl)

#simulated time series(y) with time roll and lag=1
timepoints=1000;var=6;noise.factor=.2

#past to present orientation    
y = sin((1:timepoints)*pi/30) * 1000 +
    sin((1:timepoints)*pi/40) * 1000 + 1:timepoints
y = y+rnorm(timepoints,sd=sd(y))*noise.factor
plot(y,type="l")

#convert to absolute change, with lag=1
dy = c(0,y[-1]-y[-length(y)]) # c(0,t2-t1,t3-t2,...)

#compute lag 
dy = dy + rnorm(timepoints)*sd(dy)*noise.factor #add noise
dy = c(0,y[-1]-y[-length(y)]) #convert to absolute change, with lag=1 
dX = sapply(1:40,function(i){
  getTheseLags = (1:timepoints) - i
  getTheseLags[getTheseLags<1] = NA #remove before start timePoints
  dx.lag.i = dy[getTheseLags]
})
dX[is.na(dX)]=-100 #quick fix of when lag exceed timeseries
pairs(data.frame(dy,dX[,1:5]),cex=.2)#data structure

#make train- and test-set
train=1:600
dy.train = dy[ train]
dy.test  = dy[-train]
dX.train  = dX[ train,]
dX.test   = dX[-train,]

#classic rf
rf = randomForest(dX.train,dy.train,ntree=500)
print(rf)

#like function split for a vector without mixing
split2 = function(aVector,splits=31) {
  lVector = length(aVector)
  mod = lVector %% splits
  lBlocks = rep(floor(lVector/splits),splits)
  if(mod!=0) lBlocks[1:mod] = lBlocks[1:mod] + 1
  lapply(1:splits,function(i) {
    Stop  = sum(lBlocks[1:i])
    Start = Stop - lBlocks[i] + 1
    aVector[Start:Stop]
  })
}  

nBlocks=10 #combine do not support block of unequal size
rfBlocks = foreach(aBlock = split2(train,splits=nBlocks),
                   .combine=randomForest::combine,
                   .packages=("randomForest")) %dopar% {
                     dXblock = dX.train[aBlock,] ; dyblock = dy.train[aBlock]
                     rf = randomForest(x=dXblock,y=dyblock,sampsize=length(dyblock),
                                       replace=T,ntree=50)
                   }
print(rfBlocks)

#predict test, make results table
results = data.frame(predBlock   = predict(rfBlocks,newdata=dX.test),
                     true=dy.test,
                     predBootstrap = predict(rf,newdata=dX.test))
plot(results[,1:2],xlab="OOB-CV predicted change",
     ylab="trueChange",
     main="black bootstrap and blue block train")
points(results[,3:2],xlab="OOB-CV predicted change",
       ylab="trueChange",
       col="blue")

#prediction results
print(cor(results)^2)


stopCluster(cl)#close cluster
Soren Havelund Welling
  • 1,823
  • 1
  • 16
  • 23
  • Thanks for both your answers, although the second sounded like the more feasible option. This is a different interpretation of the blockwise bootstrap, but I guess my description was not clear enough. The actual blockwise bootstrap is dividing the dataset into k parts, and resampling the blocks as if they are the observations. So in a specific part k, the actual observations are still in original order. – DaReal Aug 19 '15 at 20:24
  • Function split2 does that – Soren Havelund Welling Aug 19 '15 at 20:26
  • 600 train time points in 25 blocks. First block, first tree, first 24 time points. You can turn replace=FALSE I if you like. – Soren Havelund Welling Aug 19 '15 at 20:29
  • I accidentally posted by pressing enter, and you were quick to reply, so here's the rest of my comment: Your implementation actually had merit as well, so I gave it a try. Unfortunately, for my data it performs slightly worse in terms of rmspe. The reason I can think of, is that my data is not divisible into nice even segments, so I have to cut off observations. And thus include less information. Having read your reply: I will try again using replace = F and get back to you, thanks! Also: thanks for the parallel implementation, it saves me 1/3 of the runtime. – DaReal Aug 19 '15 at 20:37
  • ohh i was pulling a block out of the sampling pool dXblock = dX.train[-aBlock,] ; dyblock = dy.train[-aBlock] I remove the minus, such that the sampling pool is the one block alone: dXblock = dX.train[aBlock,] ; dyblock = dy.train[aBlock] – Soren Havelund Welling Aug 19 '15 at 20:39
  • I will correct my answer for that. But anyway I don't think this approach will save your model big time. Good luck :) – Soren Havelund Welling Aug 19 '15 at 20:41
  • Thanks, my adapted version idd did not save it big time, but it did slightly and consistently improve the RMSPE measure. – DaReal Aug 22 '15 at 18:41