I'm trying to run out-of-sample validations on a time-series dataset using SciKitLearn's TimeSeriesSplit()
to create train/test folds.
The idea is to train Statsmodel's SARIMAX on the train folds and then validate on the test folds without refitting the model. To do that we must iteratively append new observations from the test fold one-at-a-time to the model before predicting.
However, I get a ValueError on that append step:
ValueError: Given `endog` does not have an index that extends the index of the model.
Which to me makes no sense. If I print out print(max(train_fold.index), min(test_fold.index))
for each fold, clearly the last index of the train fold is lower than the first index of the test fold. In my case:
1983-05 1983-06
1984-05 1984-06
1985-05 1985-06
1986-05 1986-06
1987-05 1987-06
Here's the full code as it currently stands. I'm sure I'm doing something silly, but I am stuck:
# Create a generator that yields the indices of our train and test folds
split = TimeSeriesSplit(n_splits=5).split(train_series)
# Loop through each fold
for train_idcs, test_idcs in split:
# Create an empty prediction list to append to
predictions = []
# Create the folds
train_fold = train_series[train_idcs]
test_fold = train_series[test_idcs]
# Fit the model on the training fold
model_instance = sm.tsa.statespace.SARIMAX(
train_fold,
order=(1, 0, 0),
seasonal_order=(1, 0, 0, 12),
simple_differencing=True,
enforce_stationarity=False,
enforce_invertibility=False,
)
model_fitted = model_instance.fit(disp=False)
# Create the initial prediction
pred = model_fitted.forecast(steps=1)[
0
] # Slice so we just get the forecast value only
predictions.append(pred)
# Now loop through the test set, adding observations individually,
# and getting the next prediction
for i in range(len(test_fold)):
# Get the next row
next_row = test_fold.iloc[
i : i + 1
] # Returns single row but in series form (which statsmodels expects)
# Append the row to the model
model_fitted.append(next_row, refit=False)
# Get the new prediction
pred = model_fitted.forecast(steps=1)[
0
] # Slice so we just get the forecast value only
predictions.append(pred)
print(predictions)
The model_fitted.append(next_row, refit=False)
is the failure point. Any ideas? Thanks!