Random Forest has a predict function where you provide data points for independent variables that you have already trained upon and predict a value for the dependent variable. My goal is to figure how to train and predict random forest using lagged variable.
I have a data set that has following independent variables:
Quarter, US_GDP, UK, Canada, MiddleEast, Africa
my dependent variable is Total_Oil_Production
I have data from 2008Q1
to 2015Q4
and my goal is to predict oil production of the future quarters 2016 onwards.
> head(oil.data)
Quarter US_GDP UK Canada MiddleEast Africa Total_Oil_Production
1 2008Q1 14685.3 77.22900 96.73333 0.06666667 7784.333 1290.3
2 2008Q2 14668.4 78.19967 98.36667 0.36666667 7988.200 1212.8
3 2008Q3 14813.0 78.29500 98.46667 0.13333333 8090.567 1302.0
4 2008Q4 14843.0 78.63800 97.56667 0.60000000 8120.800 1136.6
5 2009Q1 14549.9 78.47733 98.23333 0.30000000 8197.200 846.4
6 2009Q2 14383.9 79.22400 99.70000 0.40000000 8278.100 748.3
As you may see I have no data for 2016 onwards quarters.
> tail(oil.data)
Quarter US_GDP UK Canada MiddleEast Africa Total_Oil_Production
31 2015Q3 17913.7 86.65300 115.7 -0.1 10985.20 1554.4
32 2015Q4 18060.2 86.85767 116.9 0.8 10933.03 1542.6
33 2016Q1 NA NA NA NA NA NA
34 2016Q2 NA NA NA NA NA NA
35 2016Q3 NA NA NA NA NA NA
36 2016Q4 NA NA NA NA NA NA
As a normal prediction problem I was going to do following steps to build the randomForest model.
- Split train data to be
2008Q1 - 2013Q4
- Test data to be
2014Q1-2015Q4
prior to that I started to read about time-series and lag variables. So I decided to add lag independent variables.
oil.data$US_GDP_L <- lag(oil.data$US_GDP, 4)
oil.data$UK_L <- lag(oil.data$UK, 4)
oil.data$Canada_L <- lag(oil.data$Canada, 4)
oil.data$MiddleEast_L <- lag(oil.data$MiddleEast, 4)
oil.data$Africa_L <- lag(oil.data$Africa, 4)
after above my data.frame looks like as follows:
> oil.data
Quarter US_GDP UK Canada MiddleEast Africa Total_Oil_Production US_GDP_L UK_L Canada_L MiddleEast_L Africa_L
1 2008Q1 14685.3 77.22900 96.73333 0.06666667 7784.333 1290.3 NA NA NA NA NA
2 2008Q2 14668.4 78.19967 98.36667 0.36666667 7988.200 1212.8 NA NA NA NA NA
3 2008Q3 14813.0 78.29500 98.46667 0.13333333 8090.567 1302.0 NA NA NA NA NA
4 2008Q4 14843.0 78.63800 97.56667 0.60000000 8120.800 1136.6 NA NA NA NA NA
5 2009Q1 14549.9 78.47733 98.23333 0.30000000 8197.200 846.4 14685.3 77.22900 96.73333 0.06666667 7784.333
6 2009Q2 14383.9 79.22400 99.70000 0.40000000 8278.100 748.3 14668.4 78.19967 98.36667 0.36666667 7988.200
7 2009Q3 14340.4 79.35367 100.76667 0.66666667 8405.167 882.0 14813.0 78.29500 98.46667 0.13333333 8090.567
8 2009Q4 14384.1 79.93233 101.26667 0.13333333 8595.100 1015.3 14843.0 78.63800 97.56667 0.60000000 8120.800
9 2010Q1 14566.5 79.69867 102.63333 1.03333333 8664.733 888.4 14549.9 78.47733 98.23333 0.30000000 8197.200
10 2010Q2 14681.1 80.22133 102.46667 -0.23333333 8794.467 863.2 14383.9 79.22400 99.70000 0.40000000 8278.100
11 2010Q3 14888.6 80.35433 102.86667 1.00000000 8943.500 1038.5 14340.4 79.35367 100.76667 0.66666667 8405.167
12 2010Q4 15057.7 80.76933 103.03333 0.73333333 9042.900 1017.1 14384.1 79.93233 101.26667 0.13333333 8595.100
13 2011Q1 15230.2 80.57133 103.56667 0.06666667 9228.233 1005.4 14566.5 79.69867 102.63333 1.03333333 8664.733
14 2011Q2 15238.4 81.10900 104.73333 1.13333333 9186.567 1037.4 14681.1 80.22133 102.46667 -0.23333333 8794.467
15 2011Q3 15460.9 81.18333 104.93333 -0.03333333 9320.833 1112.4 14888.6 80.35433 102.86667 1.00000000 8943.500
16 2011Q4 15587.1 81.64333 105.40000 0.36666667 9471.433 1084.8 15057.7 80.76933 103.03333 0.73333333 9042.900
17 2012Q1 15785.3 81.59233 105.76667 0.86666667 9624.400 1112.5 15230.2 80.57133 103.56667 0.06666667 9228.233
18 2012Q2 15973.9 82.31667 106.73333 0.13333333 9889.833 1179.8 15238.4 81.10900 104.73333 1.13333333 9186.567
19 2012Q3 16121.9 82.62600 107.66667 0.06666667 9981.067 1294.4 15460.9 81.18333 104.93333 -0.03333333 9320.833
20 2012Q4 16227.9 82.88400 107.73333 0.53333333 10081.167 1233.2 15587.1 81.64333 105.40000 0.36666667 9471.433
21 2013Q1 16297.3 82.68000 108.26667 0.33333333 10195.533 1222.1 15785.3 81.59233 105.76667 0.86666667 9624.400
22 2013Q2 16440.7 83.29167 109.46667 0.03333333 10358.000 1251.9 15973.9 82.31667 106.73333 0.13333333 9889.833
23 2013Q3 16526.8 83.53867 109.53333 0.46666667 10460.500 1403.4 16121.9 82.62600 107.66667 0.06666667 9981.067
24 2013Q4 16727.5 84.22500 109.20000 0.20000000 10547.333 1342.7 16227.9 82.88400 107.73333 0.53333333 10081.167
25 2014Q1 16957.6 84.14633 110.23333 -0.30000000 10662.133 1296.5 16297.3 82.68000 108.26667 0.33333333 10195.533
26 2014Q2 16984.3 84.86833 111.86667 0.40000000 10831.233 1270.6 16440.7 83.29167 109.46667 0.03333333 10358.000
27 2014Q3 17270.0 84.91467 111.86667 -0.93333333 11175.433 1500.0 16526.8 83.53867 109.53333 0.46666667 10460.500
28 2014Q4 17522.1 85.44067 111.83333 -3.20000000 11029.733 1451.2 16727.5 84.22500 109.20000 0.20000000 10547.333
29 2015Q1 17615.9 85.19467 112.20000 -0.20000000 10941.333 1392.3 16957.6 84.14633 110.23333 -0.30000000 10662.133
30 2015Q2 17649.3 86.17133 114.50000 0.93333333 10858.967 1346.3 16984.3 84.86833 111.86667 0.40000000 10831.233
31 2015Q3 17913.7 86.65300 115.70000 -0.10000000 10985.200 1554.4 17270.0 84.91467 111.86667 -0.93333333 11175.433
32 2015Q4 18060.2 86.85767 116.90000 0.80000000 10933.033 1542.6 17522.1 85.44067 111.83333 -3.20000000 11029.733
33 2016Q1 NA NA NA NA NA NA 17615.9 85.19467 112.20000 -0.20000000 10941.333
34 2016Q2 NA NA NA NA NA NA 17649.3 86.17133 114.50000 0.93333333 10858.967
35 2016Q3 NA NA NA NA NA NA 17913.7 86.65300 115.70000 -0.10000000 10985.200
36 2016Q4 NA NA NA NA NA NA 18060.2 86.85767 116.90000 0.80000000 10933.033
NA
has no meaning in the training but if do na.omit(oil.data) I will be loosing Quarters from 2008Q1 - 2008Q4
How do I actually train random forest (or svm) and use the predict function to predict future quarters.
How do I actually use the lag variables to train and predict randomForest or SVM model?
Do I actually remove na(s) and do the following:
n <- dim(oi.data.without.na)[1]
in.test <- seq(n - (n %/% 10 * 2), n) # integer division
test <- oi.data.without.na[in.test, ]
train <- oi.data.without.na[-in.test, ]
rm(list = c('n', 'in.test'))
and train with only from 2009Q1 to 2015Q4 data?
Once the model is been built what data points actually should we use to forecast using the predict function? Do we use the lagged variables?
I was trying out timeslice function in randomForest but initialWindow and horizon variables are not well documented?
randForest.timeSlice <- train(Total_Oil_Production ~. - Quarter, data = train, method = 'rf',
trControl = trainControl(method = 'timeslice',
initialWindow = 32,
horizon = 5,
fixedWindow = TRUE),
prox = TRUE, allowParallel = TRUE, importance = TRUE)
This give an error - initialWindow and horizon is not well explain in the documentation:
Error in seq.default(dots[[1L]][[1L]], dots[[2L]][[1L]]) :
'from' cannot be NA, NaN or infinite