6

I'm struggeling with a problem. I'm using SparkR for time series forecasting, but this scenario can also transferred to normal R environment. Instead of using ARIMA model I want to use regression models such as Random Forest Regression etc. to forecast the load of one day ahead. I also read about the sliding window approach to evaluate the performance of different regressors with respect to different parameters combinations. So to get a better understanding this is an example of the structure of my dataset:

Timestamp              UsageCPU     UsageMemory   Indicator  Delay
2014-01-03 21:50:00    3123            1231          1        123
2014-01-03 22:00:00    5123            2355          1        322
2014-01-03 22:10:00    3121            1233          2        321
2014-01-03 22:20:00    2111            1234          2        211
2014-01-03 22:30:00    1000            2222          2         0 
2014-01-03 22:40:00    4754            1599          1         0

To use any kind of regressor the next step is to extract feature and transform them into a readable format, because these regressions can not read timestamps:

Year   Month  Day  Hour    Minute    UsageCPU   UsageMemory  Indicator Delay
2014   1      3    21       50        3123        1231          1      123
2014   1      3    22       00        5123        2355          1      322
2014   1      3    22       10        3121        1233          2      321
2114   1      3    22       20        2111        1234          2      211

The next step is to create training and test set for the model.

trainTest <-randomSplit(SparkDF,c(0.7,0.3), seed=42)
train <- trainTest[[1]]
test <- trainTest[[2]]

Then it is possible to create the model + prediction (the setting of the randomForest is firstly not relevant):

model <- spark.randomForest(train, UsageCPU ~ ., type = "regression", maxDepth = 5, maxBins = 16)
predictions <- predict(model, test)

So I know all these steps and by plotting the predicted data with actual data it looks quite good. But this regression model is not dynamic, which means I can not predict one day ahead. Because the features such as UsageCPU, UsageMemory etc. does not exist, I want to predict from historical values to the next day. As mentioned in the beginning the sliding window approach can work here, but I'm not sure how to apply it (on the whole dataset, only on the training or test set).

This implementation was from shabbychef's and mbq:

 slideMean<-function(x,windowsize=3,slide=2){
 idx1<-seq(1,length(x),by=slide);
 idx1+windowsize->idx2;
 idx2[idx2>(length(x)+1)]<-length(x)+1;
 c(0,cumsum(x))->cx;
 return((cx[idx2]-cx[idx1])/windowsize);
}

The last question deals about the window size. I want to predict the next day in hours (00,01,02,03...), but the time stamps have an interval of 10min, so in my calculation the size of a window should be 144 (10*60*24 / 10).

Would be so nice if someone can help me. Thanks!

Daniel
  • 552
  • 2
  • 9
  • 29

1 Answers1

2

I also had same problem for time-series prediction using Neural nets. I implemented many models and the one that worked best was the sliding window combined with Neural nets. I also confirmed from other Researchers in the field. from this we conclude that if you want to predict 1 day ahead (24 horizons) in a single step training will be demanding for the system. We proceeded the following:

1. We had a sliding window of 24 hours. e.g lets use [1,2,3] here
2. Then use ML model to predict the [4]. Meaning use value 4 as target. 
# As illustration we had 
x = [1,2,3] 
# then set target as 
y=[4]. 
# We had a function that returns the x=[1,2,3] and y =[4] and
# shift the window in the next training step. 
3.To the:
x =[1,2,3] 
we can add further features that are important to the model. 
x=[1,2,3,feature_x]

4. Then we minimise error and shift the window to have:
 x = [2,3,4,feature_x] and y = [5]. 
5. You could also predict two values ahead. e.g [4,5] .
6. Use a list to collect output and plot
7. Make prediction after the training.
smile
  • 574
  • 5
  • 11
  • That's cool, thank you for your answer. Just some questions for correct interpretation, so that means in your case, you had a dataset represented in hours from 1-24h and only represented in the variable x (not for every h a column), right?. Would it be more precise if I would go for minutes? So my x value would be then x[15,30,45,60,...,1440] and the horizon also 1440. But I am not sure what do you mean with y[4]. Is y my target value that i want to predict (UsageCPU)? So what do you mean can I use the sliding window function from above and integrate it or should I re-code it as you described? – Daniel May 05 '17 at 18:34
  • If your data size is 14440 or more. I chose a window of 24 (it contains 24 x values). Then for first iteration I take the window 24 and predict the 25th value. The 25th value will be assume my target. After this I shift my window and drop first value from the window while adding 25th value , and I predict the 26th value. If you have x[15......1440] you can predict the 1441th value only. Then shift window i.e drop 15 , add 1441 ,and predict 1442. Like this you can predict many time steps ahead. – smile May 06 '17 at 09:40
  • And to make things clearer. It does not matter of you go for minutes or hours. The aim is I define a window on the target (UsageCPU). Then UsageCPU=[1,2,3,4 ] window size. Then for each iteration i get input / ouput pair like Usage[1] as x , and Usage[5] as target. This means I am predicting 5 steps ahead. Then I shift the window. Drop Usage[1] , use Usage[2] then predict Uage[6]. Now to your x value in each iteration you add other information to assist prediction. Example x[1, hour, month, year, uageMemory, delay] ---> – smile May 06 '17 at 10:00
  • Again, thank you so much for your detailed explanation! Hopefully it is easy to implement in R. But with your explanation I think I should be fine, otherwise I will ask a question on StackOverflow ;). – Daniel May 06 '17 at 15:22
  • 1
    http://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/ - I think that is exactly what you mean right? – Daniel May 08 '17 at 18:40