4

I'm trying to run out-of-sample validations on a time-series dataset using SciKitLearn's TimeSeriesSplit() to create train/test folds.

The idea is to train Statsmodel's SARIMAX on the train folds and then validate on the test folds without refitting the model. To do that we must iteratively append new observations from the test fold one-at-a-time to the model before predicting.

However, I get a ValueError on that append step: ValueError: Given `endog` does not have an index that extends the index of the model.

Which to me makes no sense. If I print out print(max(train_fold.index), min(test_fold.index)) for each fold, clearly the last index of the train fold is lower than the first index of the test fold. In my case:

1983-05 1983-06
1984-05 1984-06
1985-05 1985-06
1986-05 1986-06
1987-05 1987-06

Here's the full code as it currently stands. I'm sure I'm doing something silly, but I am stuck:

# Create a generator that yields the indices of our train and test folds
split = TimeSeriesSplit(n_splits=5).split(train_series)

# Loop through each fold
for train_idcs, test_idcs in split:

    # Create an empty prediction list to append to
    predictions = []

    # Create the folds
    train_fold = train_series[train_idcs]
    test_fold = train_series[test_idcs]

    # Fit the model on the training fold
    model_instance = sm.tsa.statespace.SARIMAX(
        train_fold,
        order=(1, 0, 0),
        seasonal_order=(1, 0, 0, 12),
        simple_differencing=True,
        enforce_stationarity=False,
        enforce_invertibility=False,
    )
    model_fitted = model_instance.fit(disp=False)

    # Create the initial prediction
    pred = model_fitted.forecast(steps=1)[
        0
    ]  # Slice so we just get the forecast value only
    predictions.append(pred)

    # Now loop through the test set, adding observations individually,
    # and getting the next prediction
    for i in range(len(test_fold)):

        # Get the next row
        next_row = test_fold.iloc[
            i : i + 1
        ]  # Returns single row but in series form (which statsmodels expects)

        # Append the row to the model
        model_fitted.append(next_row, refit=False)

        # Get the new prediction
        pred = model_fitted.forecast(steps=1)[
            0
        ]  # Slice so we just get the forecast value only
        predictions.append(pred)

    print(predictions)

The model_fitted.append(next_row, refit=False) is the failure point. Any ideas? Thanks!

Laurent
  • 12,287
  • 7
  • 21
  • 37
Josh
  • 167
  • 1
  • 13

1 Answers1

4

Got it! It was silly.

The .append() method of the SARIMAX model returns the model itself rather than changing the data stored in the model.

So the correct code is simmply: model_fitted = model_fitted.append(next_row, refit=False)

Josh
  • 167
  • 1
  • 13