33

In sklearn, GridSearchCV can take a pipeline as a parameter to find the best estimator through cross validation. However, the usual cross validation is like this:enter image description here

to cross validate a time series data, the training and testing data are often splitted like this:enter image description here

That is to say, the testing data should be always ahead of training data.

My thought is:

  1. Write my own version class of k-fold and passing it to GridSearchCV so I can enjoy the convenience of pipeline. The problem is that it seems difficult to let GridSearchCV to use an specified indices of training and testing data.

  2. Write a new class GridSearchWalkForwardTest which is similar to GridSearchCV, I am studying the source code grid_search.py and find it is a little complicated.

Any suggestion is welcome.

Community
  • 1
  • 1
PhilChang
  • 2,591
  • 1
  • 16
  • 18

6 Answers6

11

I think you could use a TimeSeriesSplit() either instead of your own implementation or as a basis for implementing a CV method which is exactly as you describe it.

After digging around a bit, it seems like someone added a max_train_size to the TimeSeriesSplit() in this PR, which seems like it does what you want.

Mario
  • 1,631
  • 2
  • 21
  • 51
Matthijs Brouns
  • 2,299
  • 1
  • 27
  • 37
  • 1
    you're right, **walk-forward cross-validation** is sci-kit learn's **TimeSeriesSplit** algorithm. But how to select it as a choice for a "cv" object in CV estimators like LassoCV and ElasticNetCV? KFold, LeaveOneOut, train_test_split and other algorithms belong to the **cross_validation module** of sklearn from which we can select a "cv" object for these estimators. However, TimeSeriesSplit belongs to the **model_selection module** of sklearn, not currently making it a choice. – develarist Nov 29 '19 at 14:45
2

I did some work regarding all this some months ago.

You could check it out in this question/answer:

Rolling window REVISITED - Adding window rolling quantity as a parameter- Walk Forward Analysis

Ezarate11
  • 439
  • 6
  • 11
1

My opinion is that you should try to implement your own GridSearchWalkForwardTest. I used GridSearch once to do the training and implemented the same GridSearch myself and I didn't get the same results, eventhough I should.

What I did at the end is using my own function. You have more control over the training and test set and you have more control over the parameters you train.

hoaphumanoid
  • 977
  • 1
  • 9
  • 25
1

I've written some code that I hope could be helpful to someone.

'sequence' is the period of the time series. I am training a model on sequence up to 40, predicting 41, then training up to 41 to predict 42, and so on...up until the max. 'quantity' is the target variable. And then my average of all of the errors will be my metric of evaluation

for sequence in range(40, df.sequence.max() + 1):
        train = df[df['sequence'] < sequence]
        test = df[df['sequence'] == sequence]
        X_train, X_test = train.drop(['quantity'], axis=1), test.drop(['quantity'], axis=1)
        y_train, y_test = train['quantity'].values, test['quantity'].values
    
        mdl = LinearRegression()
        mdl.fit(X_train, y_train)
        y_pred = mdl.predict(X_test) 
        error = sklearn.metrics.mean_squared_error(test['quantity'].values, y_pred)
        RMSE.append(error)
print('Mean RMSE = %.5f' % np.mean(RMSE))
    
addi wei
  • 71
  • 1
  • 3
1

Leveraging sktime TimeSeriesSplit, defining train and test size fixed rolling windows. Note first training window may include additional excess data (prefer to keep than to clip):

def tscv(X, train_size, test_size):
    folds = math.floor(len(X) / test_size)
    tscv = TimeSeriesSplit(n_splits=folds, test_size=test_size)
    splits = []
    for train_index, test_index in tscv.split(X):
        if len(train_index) < train_size:
            continue
        elif len(train_index) - train_size < test_size and len(train_index) - train_size > 0:
            pass
        else:
            train_index = train_index[-train_size:]
        splits.append([train_index, test_index])
    return splits
John Richardson
  • 676
  • 8
  • 24
0

I use this custom class to create disjoint splits based on StratifiedKFold (could be replaced by KFold or others), in order to create the following training scheme:

|X||V|O|O|O|
|O|X||V|O|O|
|O|O|X||V|O|
|O|O|O|X||V|

X / V are the training / validation sets. "||" indicates a gap (parameter n_gap: int>0) truncated at the beginning of the validation set, in order to prevent leakage effects.

You could easily extend it to get longer lookback windows for the training sets.

class StratifiedWalkForward(object):
    
    def __init__(self,n_splits,n_gap):
        self.n_splits = n_splits
        self.n_gap = n_gap
        self._cv = StratifiedKFold(n_splits=self.n_splits+1,shuffle=False)
        return
    
    def split(self,X,y,groups=None):
        splits = self._cv.split(X,y)
        _ixs = []
        for ix in splits: 
            _ixs.append(ix[1])
        for i in range(1,len(_ixs)): 
            yield tuple((_ixs[i-1],_ixs[i][_ixs[i]>_ixs[i-1][-1]+self.n_gap]))
            
    def get_n_splits(self,X,y,groups=None):
        return self.n_splits

Note that the datasets may not be perfectly stratified afterwards, cause of the truncation with n_gap.

user101893
  • 358
  • 2
  • 13