splitting data for time series prediction

Question

I am looking for an R package, which allows me to do n-fold CV type hyper parameter optimisation (e.g. n = 10). Let us say this is the data I can used to tweak the hyper parameters (I tend to use rBayesianOptimization so let us abstract this away):

dates <- seq(as.Date('2017-01-01'), as.Date('2019-12-31'), by = 'days')

df <- data.frame(date = dates)
df$y <- 42

Here y, the dependent variable, is an obviously well known constant and it is just added here without being exploited.

I came across the caret function createTimeSlices and this would be a possible approach to split the data:

slices <- createTimeSlices(df$date, initialWindow = 365 * 2.5, horizon = 30, fixedWindow = TRUE)

I end up with a list like this:

List of 2
 $ train:List of 153
  ..$ Training0912.5: int [1:912] 1 2 3 4 5 6 7 8 9 10 ...
  ..$ Training0913.5: int [1:912] 2 3 4 5 6 7 8 9 10 11 ...
...   
  ..$ Training1010.5: int [1:912] 99 100 101 102 103 104 105 106 107 108 ...
  .. [list output truncated]
 $ test :List of 153
  ..$ Testing0912.5: num [1:30] 914 914 916 916 918 ...
  ..$ Testing0913.5: num [1:30] 914 916 916 918 918 ...

Can someone please point pout how to use this or refer me to another package? Personally, I am a bit confused about the training data indices only to shift 1 day (?). I would have thought it shifts 30 days (see horizon).

Thanks.

What is your specific goal? You mention 10-fold CV, parameter tuning, and data segmentation, and also a question about how to use the `createTimeSlices` function. If you can provide exact expected output, and clarify what your question is, that'd be helpful. — andrew_reece, Dec 30 '20 at 17:23
It is a bit hard to provide an example in this casa imho. In 10 fold CV you use 9 folds for training and 1 for validation. You get 10 performance measures e.g. the AUC. You can then calculate the average for the hyper parameter setting. I want to do something similar for time series. Obviously, time series are sequential but the principle is the same. — cs0815, Dec 30 '20 at 17:28
You might want to check out packages fable & fabletools and modeltime. They deal with timeseries and forecasting. — phiver, Dec 30 '20 at 18:28
@phiver these package look interesting. If you could be so kind an point me to the doc/code to show how to split a ts that would be very kind. As so often the doc looks sketchy ... — cs0815, Dec 31 '20 at 11:19
@cs0815, the [must read book](https://otexts.com/fpp3/) for fable & fabletools and forecasting in general. For modeltime: https://www.business-science.io/code-tools/2020/06/29/introducing-modeltime.html modeltime.ensemble: https://www.business-science.io/code-tools/2020/10/13/introducing-modeltime-ensemble.html modeltime.resample: https://cran.r-project.org/web/packages/modeltime.resample/vignettes/getting-started.html — phiver, Dec 31 '20 at 11:51
I was aware of the must read book thanks. In the meantime I found a way to use createTimeSlices — cs0815, Dec 31 '20 at 11:56

score 0 · Accepted Answer · answered Dec 31 '20 at 12:01

I found a way to use createTimeSlices inspired by Shambho's SO answer.

library(caret)

dates <- seq(as.Date('2017-01-01'), as.Date('2019-12-31'), by = 'days')

df <- data.frame(date = dates)
df$x <- 1
df$y <- 42

timeSlices <- createTimeSlices(1:nrow(df), initialWindow = 365 * 2, horizon = 30, fixedWindow = TRUE, skip = 30)

#str(timeSlices, max.level = 1)

trainSlices <- timeSlices[[1]]
testSlices <- timeSlices[[2]]

for (i in 1:length(trainSlices)) {

    train <- df[trainSlices[[i]],]
    test <- df[testSlices[[i]],]

    # fit and calculate performance on test to ultimately get average etc.

    print(paste0(min(train$date), " - ", max(train$date)))
    print(paste0(min(test$date), " - ", max(test$date)))
    print("")
}

The key for me was to specify skip as otherwise the window would only move 1 day and one would end up with to many "folds".

splitting data for time series prediction

1 Answers1