7

I have a data set like the following

set.seed(503)
foo <- data.table(group = rep(LETTERS[1:6], 150),
                  y  = rnorm(n = 6 * 150, mean = 5, sd = 2),
                  x1 = rnorm(n = 6 * 150, mean = 5, sd = 10),
                  x2 = rnorm(n = 6 * 150, mean = 25, sd = 10),
                  x3 = rnorm(n = 6 * 150, mean = 50, sd = 10),
                  x4 = rnorm(n = 6 * 150, mean = 0.5, sd = 10),
                  x5 = sample(c(1, 0), size = 6 * 150, replace = T))

foo[, period := 1:.N, by = group]

Problem: I want to forecast y one step ahead, for each group, using variables x1, ..., x5

I want to run a few models in caret to decide which I will use.

As of now, I am running it in a loop using timeslice

window.length <- 115
timecontrol   <- trainControl(method          = 'timeslice',
                            initialWindow     = window.length,
                            horizon           = 1, 
                            selectionFunction = "best",
                            fixedWindow       = TRUE, 
                            savePredictions   = 'final')

model_list <- list()
for(g in unique(foo$group)){
  for(model in c("xgbTree", "earth", "cubist")){
    dat <- foo[group == g][, c('group', 'period') := NULL]
    model_list[[g]][[model]] <- train(y ~ . - 1,
                                      data = dat,
                                      method = model, 
                                      trControl = timecontrol)

  }
}

However, I would like to run all groups at the same time, using dummy variables to identify each one, like

dat <- cbind(foo,  model.matrix(~ group- 1, foo))
            y         x1       x2       x3            x4 x5 period groupA groupB groupC groupD groupE groupF
  1: 5.710250 11.9615460 22.62916 31.04790 -4.821331e-04  1      1      1      0      0      0      0      0
  2: 3.442213  8.6558983 32.41881 45.70801  3.255423e-01  1      1      0      1      0      0      0      0
  3: 3.485286  7.7295448 21.99022 56.42133  8.668391e+00  1      1      0      0      1      0      0      0
  4: 9.659601  0.9166456 30.34609 55.72661 -7.666063e+00  1      1      0      0      0      1      0      0
  5: 5.567950  3.0306864 22.07813 52.21099  5.377153e-01  1      1      0      0      0      0      1      0

But still running the time series with the correct time ordering using timeslice.

Is there a way to declare the time variable in trainControl, so my one step ahead forecast uses, in this case, six more observations for each round and droping the first 6 observations?

I can do it by ordering the data and messing with the horizon argument (given n groups, order by the time variable and put horizon = n), but this has to change if the number of groups change. And initial.window will have to be time * n_groups

timecontrol   <- trainControl(method          = 'timeslice',
                            initialWindow     = window.length * length(unique(foo$group)),
                            horizon           = length(unique(foo$group)), 
                            selectionFunction = "best",
                            fixedWindow       = TRUE, 
                            savePredictions   = 'final')

Is there any ohter way?

Juan Carlos Ramirez
  • 2,054
  • 1
  • 7
  • 22
Felipe Alvarenga
  • 2,572
  • 1
  • 17
  • 36

2 Answers2

3

I think the answer you are looking for is actually quite simple. You can use the skip argument to trainControl() to skip the desired number of observations after each train/test set. In this way, you only predict each group-period once, the same period is never split between the training group and testing group, and there is no information leakage.

Using the example you provided, if you set skip = 6 and horizon = 6 (the number of groups), and initialWindow = 115, then the first test set will include all groups for period 116, the next test set will include all groups for period 117, and so on.

library(caret)
library(tidyverse)

set.seed(503)
foo <- tibble(group = rep(LETTERS[1:6], 150),
                  y  = rnorm(n = 6 * 150, mean = 5, sd = 2),
                  x1 = rnorm(n = 6 * 150, mean = 5, sd = 10),
                  x2 = rnorm(n = 6 * 150, mean = 25, sd = 10),
                  x3 = rnorm(n = 6 * 150, mean = 50, sd = 10),
                  x4 = rnorm(n = 6 * 150, mean = 0.5, sd = 10),
                  x5 = sample(c(1, 0), size = 6 * 150, replace = T)) %>% 
  group_by(group) %>% 
  mutate(period = row_number()) %>% 
  ungroup() 

dat <- cbind(foo,  model.matrix(~ group- 1, foo)) %>% 
  select(-group)

window.length <- 115

timecontrol   <- trainControl(
  method            = 'timeslice',
  initialWindow     = window.length * length(unique(foo$group)),
  horizon           = length(unique(foo$group)),
  skip              = length(unique(foo$group)),
  selectionFunction = "best",
  fixedWindow       = TRUE,
  savePredictions   = 'final'
)

model_names <- c("xgbTree", "earth", "cubist")
fits <- map(model_names,
            ~ train(
              y ~ . - 1,
              data = dat,
              method = .x,
              trControl = timecontrol
            )) %>% 
  set_names(model_names)
Giovanni Colitti
  • 1,982
  • 11
  • 24
0

I would use tidyr::nest() to nest groups and then iterate over the data with purrr::map(). This approach is much more flexible because it can accommodate different group sizes, different numbers of groups, and variable models or other arguments passed to caret::train(). Also, you can easily run everything in parallel using furrr.

Load packages and create data

I use tibble instead of data.table. I also reduce the size of the data.

library(caret)
library(tidyverse)

set.seed(503)

foo <- tibble(
  group = rep(LETTERS[1:6], 10),
  y  = rnorm(n = 6 * 10, mean = 5, sd = 2),
  x1 = rnorm(n = 6 * 10, mean = 5, sd = 10),
  x2 = rnorm(n = 6 * 10, mean = 25, sd = 10),
  x3 = rnorm(n = 6 * 10, mean = 50, sd = 10),
  x4 = rnorm(n = 6 * 10, mean = 0.5, sd = 10),
  x5 = sample(c(1, 0), size = 6 * 10, replace = T)
) %>%
  group_by(group) %>%
  mutate(period = row_number()) %>%
  ungroup()

Reduce initialWindow size

window.length <- 9
timecontrol   <- trainControl(
  method          = 'timeslice',
  initialWindow     = window.length,
  horizon           = 1,
  selectionFunction = "best",
  fixedWindow       = TRUE,
  savePredictions   = 'final'
)

Create a function that will return a list of fit model objects

# To fit each model in model_list to data and return model fits as a list.
fit_models <- function(data, model_list, timecontrol) {
  map(model_list,
      ~ train(
        y ~ . - 1,
        data = data,
        method = .x,
        trControl = timecontrol
      )) %>%
    set_names(model_list)
}

Fit models

model_list <- c("xgbTree", "earth", "cubist")
mods <- foo %>% 
  nest(-group) 

mods <- mods %>%
  mutate(fits = map(
    data,
    ~ fit_models(
      data = .x,
      model_list = model_list,
      timecontrol = timecontrol
    )
  ))

If you want to view the results for a particular group / model you can do:

mods[which(mods$group == "A"), ]$fits[[1]]$xgbTree

Use furrr for parallel processing

Just initialize workers with plan(multiprocess) and change map to future_map. Note you might want to change the number of workers to something less than 6 if your computer has fewer than 6 processing cores.

library(furrr)
plan(multiprocess, workers = 6)

mods <- foo %>% 
  nest(-group) 

mods <- mods %>%
  mutate(fits = future_map(
    data,
    ~ fit_models(
      data = .x,
      model_list = model_list,
      timecontrol = timecontrol
    )
  ))
Giovanni Colitti
  • 1,982
  • 11
  • 24
  • As I understood, you are running different models for each group, right? The point is to run one model, differentiating groups by dummies. – Felipe Alvarenga Dec 05 '19 at 18:17
  • So you just want a more elegant way of making `initialWindow` and `horizon` depend on group size? Is the code you provide at the end of your question already giving you the desired results? – Giovanni Colitti Dec 05 '19 at 19:37
  • Do you want to predict each group/period only once during training? – Giovanni Colitti Dec 05 '19 at 19:52
  • None of the above. I want to run 1 model with group dummies, instead of running separate models for each group. These are two very different things. My code at the end does the second option, which I can already do. I need a way to do the first, a singlemodel for all groups, accouting for time dependency – Felipe Alvarenga Dec 05 '19 at 20:18
  • I understand you don't want to train separate models independently by group, which is what I did in my answer. What is wrong with the `timecontrol` you define at the very end? – Giovanni Colitti Dec 05 '19 at 20:25
  • If you want to fit a single model where you use all group data, then why can't you order the data by period and pass this to `train`? The `trainControl` object you create at the end should work just fine and adjust based on different numbers of groups. Since you only use earlier periods to predict later periods, you are accounting for the time component. Can you help me understand why this doesn't work for you? – Giovanni Colitti Dec 05 '19 at 20:41
  • Perhaps what you are getting at is this: When you use the `trainControl` object created at the end of your answer, there is some information leakage because after the first train/test set where you use 115 * 6 obs to predict the next 6 obs, by default the train set adds only the next observation. So now you are predicting period 117 for group A and 116 for groups B-F, where period 116 for group A was added to the training set. So you are using the same period in the training and testing set? Is this what you are getting at? – Giovanni Colitti Dec 05 '19 at 21:30
  • Okay, I think you are looking for the `skip` argument to `trainControl()`. See my latest answer. – Giovanni Colitti Dec 06 '19 at 21:53