I'm trying to train a model on a panel of different units over time. I understand how to use createTimeSlices
from the caret
package, but I'd like to use this same process while simultaneously holding out different units in different training folds. In the example below, an example of this would be training with time periods ("t") 1 and 2 for units A and B and testing on time period 3 for units A, B, and C. Using indexes, that would entail training on rows 1, 2, 6, and 7 and testing on rows 3, 8, and 12. In this particular fold, unit C is held out, but I'd ideally be able to create a set of folds that holds out each unit for each time window as the time window moves forward in time.
For example, another fold that leaves out unit A and predicts t = 5 would train on rows (assuming a fixed window of 2 and a horizon of 1) 8, 9, 12, and 13 and test on rows 5, 10, and 14.
I do not want a given testing time period to be used in the corresponding training fold.
I should also note that, as in the example below, not every unit has the same number of observed time periods, and the observed time periods for each unit are not necessarily the same. In these cases, I would prefer to not throw out any data, but if it's necessary, I'm ok with balancing the panel.
I have not found anything built-in with caret
to do this type of cross validation. I've seen one related question that uses the skip
argument in createTimeSlices
, but I don't think this will work for my question, as it seems to require a balanced panel and does not allow for holding out units.
Example data:
set.seed(123)
data = tibble(
unit = rep(c('A','B','C'), times = c(5, 5, 6)),
t = c(1:5, 1:5, 2:7),
y = rnorm(16),
x1 = rnorm(16, mean = 10, sd = 2),
x2 = runif(16, min= 5, max = 10)
)