3

Random Forest has a predict function where you provide data points for independent variables that you have already trained upon and predict a value for the dependent variable. My goal is to figure how to train and predict random forest using lagged variable.

I have a data set that has following independent variables:

Quarter, US_GDP, UK, Canada, MiddleEast, Africa

my dependent variable is Total_Oil_Production

location: data file

I have data from 2008Q1 to 2015Q4 and my goal is to predict oil production of the future quarters 2016 onwards.

> head(oil.data)
  Quarter  US_GDP       UK   Canada MiddleEast   Africa Total_Oil_Production
1  2008Q1 14685.3 77.22900 96.73333 0.06666667 7784.333               1290.3
2  2008Q2 14668.4 78.19967 98.36667 0.36666667 7988.200               1212.8
3  2008Q3 14813.0 78.29500 98.46667 0.13333333 8090.567               1302.0
4  2008Q4 14843.0 78.63800 97.56667 0.60000000 8120.800               1136.6
5  2009Q1 14549.9 78.47733 98.23333 0.30000000 8197.200                846.4
6  2009Q2 14383.9 79.22400 99.70000 0.40000000 8278.100                748.3

As you may see I have no data for 2016 onwards quarters.

> tail(oil.data)
   Quarter  US_GDP       UK Canada MiddleEast   Africa Total_Oil_Production
31  2015Q3 17913.7 86.65300  115.7       -0.1 10985.20               1554.4
32  2015Q4 18060.2 86.85767  116.9        0.8 10933.03               1542.6
33  2016Q1      NA       NA     NA         NA       NA                   NA
34  2016Q2      NA       NA     NA         NA       NA                   NA
35  2016Q3      NA       NA     NA         NA       NA                   NA
36  2016Q4      NA       NA     NA         NA       NA                   NA

As a normal prediction problem I was going to do following steps to build the randomForest model.

  1. Split train data to be 2008Q1 - 2013Q4
  2. Test data to be 2014Q1-2015Q4

prior to that I started to read about time-series and lag variables. So I decided to add lag independent variables.

oil.data$US_GDP_L <- lag(oil.data$US_GDP, 4)
oil.data$UK_L <- lag(oil.data$UK, 4)
oil.data$Canada_L <- lag(oil.data$Canada, 4)
oil.data$MiddleEast_L <- lag(oil.data$MiddleEast, 4)
oil.data$Africa_L <- lag(oil.data$Africa, 4)

after above my data.frame looks like as follows:

> oil.data
   Quarter  US_GDP       UK    Canada  MiddleEast    Africa Total_Oil_Production US_GDP_L     UK_L  Canada_L MiddleEast_L  Africa_L
1   2008Q1 14685.3 77.22900  96.73333  0.06666667  7784.333               1290.3       NA       NA        NA           NA        NA
2   2008Q2 14668.4 78.19967  98.36667  0.36666667  7988.200               1212.8       NA       NA        NA           NA        NA
3   2008Q3 14813.0 78.29500  98.46667  0.13333333  8090.567               1302.0       NA       NA        NA           NA        NA
4   2008Q4 14843.0 78.63800  97.56667  0.60000000  8120.800               1136.6       NA       NA        NA           NA        NA
5   2009Q1 14549.9 78.47733  98.23333  0.30000000  8197.200                846.4  14685.3 77.22900  96.73333   0.06666667  7784.333
6   2009Q2 14383.9 79.22400  99.70000  0.40000000  8278.100                748.3  14668.4 78.19967  98.36667   0.36666667  7988.200
7   2009Q3 14340.4 79.35367 100.76667  0.66666667  8405.167                882.0  14813.0 78.29500  98.46667   0.13333333  8090.567
8   2009Q4 14384.1 79.93233 101.26667  0.13333333  8595.100               1015.3  14843.0 78.63800  97.56667   0.60000000  8120.800
9   2010Q1 14566.5 79.69867 102.63333  1.03333333  8664.733                888.4  14549.9 78.47733  98.23333   0.30000000  8197.200
10  2010Q2 14681.1 80.22133 102.46667 -0.23333333  8794.467                863.2  14383.9 79.22400  99.70000   0.40000000  8278.100
11  2010Q3 14888.6 80.35433 102.86667  1.00000000  8943.500               1038.5  14340.4 79.35367 100.76667   0.66666667  8405.167
12  2010Q4 15057.7 80.76933 103.03333  0.73333333  9042.900               1017.1  14384.1 79.93233 101.26667   0.13333333  8595.100
13  2011Q1 15230.2 80.57133 103.56667  0.06666667  9228.233               1005.4  14566.5 79.69867 102.63333   1.03333333  8664.733
14  2011Q2 15238.4 81.10900 104.73333  1.13333333  9186.567               1037.4  14681.1 80.22133 102.46667  -0.23333333  8794.467
15  2011Q3 15460.9 81.18333 104.93333 -0.03333333  9320.833               1112.4  14888.6 80.35433 102.86667   1.00000000  8943.500
16  2011Q4 15587.1 81.64333 105.40000  0.36666667  9471.433               1084.8  15057.7 80.76933 103.03333   0.73333333  9042.900
17  2012Q1 15785.3 81.59233 105.76667  0.86666667  9624.400               1112.5  15230.2 80.57133 103.56667   0.06666667  9228.233
18  2012Q2 15973.9 82.31667 106.73333  0.13333333  9889.833               1179.8  15238.4 81.10900 104.73333   1.13333333  9186.567
19  2012Q3 16121.9 82.62600 107.66667  0.06666667  9981.067               1294.4  15460.9 81.18333 104.93333  -0.03333333  9320.833
20  2012Q4 16227.9 82.88400 107.73333  0.53333333 10081.167               1233.2  15587.1 81.64333 105.40000   0.36666667  9471.433
21  2013Q1 16297.3 82.68000 108.26667  0.33333333 10195.533               1222.1  15785.3 81.59233 105.76667   0.86666667  9624.400
22  2013Q2 16440.7 83.29167 109.46667  0.03333333 10358.000               1251.9  15973.9 82.31667 106.73333   0.13333333  9889.833
23  2013Q3 16526.8 83.53867 109.53333  0.46666667 10460.500               1403.4  16121.9 82.62600 107.66667   0.06666667  9981.067
24  2013Q4 16727.5 84.22500 109.20000  0.20000000 10547.333               1342.7  16227.9 82.88400 107.73333   0.53333333 10081.167
25  2014Q1 16957.6 84.14633 110.23333 -0.30000000 10662.133               1296.5  16297.3 82.68000 108.26667   0.33333333 10195.533
26  2014Q2 16984.3 84.86833 111.86667  0.40000000 10831.233               1270.6  16440.7 83.29167 109.46667   0.03333333 10358.000
27  2014Q3 17270.0 84.91467 111.86667 -0.93333333 11175.433               1500.0  16526.8 83.53867 109.53333   0.46666667 10460.500
28  2014Q4 17522.1 85.44067 111.83333 -3.20000000 11029.733               1451.2  16727.5 84.22500 109.20000   0.20000000 10547.333
29  2015Q1 17615.9 85.19467 112.20000 -0.20000000 10941.333               1392.3  16957.6 84.14633 110.23333  -0.30000000 10662.133
30  2015Q2 17649.3 86.17133 114.50000  0.93333333 10858.967               1346.3  16984.3 84.86833 111.86667   0.40000000 10831.233
31  2015Q3 17913.7 86.65300 115.70000 -0.10000000 10985.200               1554.4  17270.0 84.91467 111.86667  -0.93333333 11175.433
32  2015Q4 18060.2 86.85767 116.90000  0.80000000 10933.033               1542.6  17522.1 85.44067 111.83333  -3.20000000 11029.733
33  2016Q1      NA       NA        NA          NA        NA                   NA  17615.9 85.19467 112.20000  -0.20000000 10941.333
34  2016Q2      NA       NA        NA          NA        NA                   NA  17649.3 86.17133 114.50000   0.93333333 10858.967
35  2016Q3      NA       NA        NA          NA        NA                   NA  17913.7 86.65300 115.70000  -0.10000000 10985.200
36  2016Q4      NA       NA        NA          NA        NA                   NA  18060.2 86.85767 116.90000   0.80000000 10933.033

NA has no meaning in the training but if do na.omit(oil.data) I will be loosing Quarters from 2008Q1 - 2008Q4

How do I actually train random forest (or svm) and use the predict function to predict future quarters.

How do I actually use the lag variables to train and predict randomForest or SVM model?

Do I actually remove na(s) and do the following:

n <- dim(oi.data.without.na)[1]
in.test <- seq(n - (n %/% 10 * 2), n)  # integer division
test <- oi.data.without.na[in.test, ]
train <- oi.data.without.na[-in.test, ]
rm(list = c('n', 'in.test')) 

and train with only from 2009Q1 to 2015Q4 data?

Once the model is been built what data points actually should we use to forecast using the predict function? Do we use the lagged variables?

I was trying out timeslice function in randomForest but initialWindow and horizon variables are not well documented?

randForest.timeSlice <- train(Total_Oil_Production ~. - Quarter, data = train, method = 'rf',
                     trControl = trainControl(method = 'timeslice',
                                              initialWindow = 32,
                                              horizon = 5,
                                              fixedWindow = TRUE),
                     prox = TRUE, allowParallel = TRUE, importance = TRUE)

This give an error - initialWindow and horizon is not well explain in the documentation:

Error in seq.default(dots[[1L]][[1L]], dots[[2L]][[1L]]) : 
  'from' cannot be NA, NaN or infinite
add-semi-colons
  • 18,094
  • 55
  • 145
  • 232
  • I may be wrong but i dont think RF can include time variable in their formulation. You could omit it and then try to predict the variable of interest. – Chirayu Chamoli Jul 05 '16 at 04:49
  • 1
    You may have a look at http://stackoverflow.com/questions/24758218/time-series-data-spliting-and-model-evaluation that proposes to first create slices manually. I don't see any issue using lagged data and indeed you will have then to first remove those first 4 records (ie begin 2009Q1). – Eric Lecoutre Jul 05 '16 at 08:55
  • 1
    Note that you may be interested in traditional forecast models, especially as Total_Oil_Production is the total of the rest. Possibly use hierarchical forecasting, for instance with `hts` forecast. Not only will you have forecasts for every geographical area and total but sum will be coherent – Eric Lecoutre Jul 05 '16 at 08:57
  • @EricLecoutre Thanks, so lets say I build the randomForest `rf<- randomForest(Total_Oil_Production ~ ., data=train, importance=TRUE, na.action = na.roughfix)` There is a good chance this model contains features with lagged and none-lagged. Since its out of sample prediction I will not have test data for none-lagged features. How does this should handle? – add-semi-colons Jul 06 '16 at 20:44

0 Answers0