0

I have a dataset containing sales data for different products over time. The dataset includes a "Time" column representing the date and a "Product" column specifying the product ID. As multiple products can be sold on the same date, the "Time" column does not have unique values.

I am trying to perform cross-validation on this time series data to train an ML model using the expanding window approach. The expanding window approach involves iteratively increasing the size of the training set while keeping the test set as the most recent observations.

I implemented a custom cross-validation class ExpandingWindowCV based on sklearn.model_selection.BaseCrossValidator. The class takes parameters for the minimum training set size, shuffle option, and random state. The _iter_test_indices method is responsible for generating the test indices based on the expanding window approach.

However, when I applied this custom cross-validation with RandomizedSearchCV for hyperparameter tuning, I noticed that the resulting train and test sets overlapped, which is incorrect for time series data

I attempted to debug the issue by printing the train and test dates and indices within the _iter_test_indices method. Surprisingly, the printed output showed that the train and test sets were correctly split without any overlap.

class ExpandingWindowCV(BaseCrossValidator):
def __init__(self, min_train_size=1, shuffle=False, random_state=None):
    super().__init__()
    self.min_train_size = min_train_size
    self.shuffle = shuffle
    self.random_state = random_state

def _iter_test_indices(self, X=None, y=None, groups=None):
    unique_dates = np.unique(X['Time'].sort_values())  # Assuming 'Time' is the column name
    n_dates = len(unique_dates)
    print(f"X \n {X[['Time', 'Product']]}")
    if self.shuffle:
        rng = check_random_state(self.random_state)
        rng.shuffle(unique_dates)

    for i in range(self.min_train_size, n_dates):
        train_dates = unique_dates[:i]
        test_dates = unique_dates[i:]
        
        train_indices = np.where(X['Time'].isin(train_dates))[0]
        test_indices =  np.where(X['Time'].isin(test_dates) & ~X['Time'].isin(train_dates))[0] #X.loc[X.Time.isin(test_dates), :].index
        print(f"train_dates {train_dates}")
        print(f"test_dates {test_dates}")
        print(f"train_indices \n {train_indices}")
        print(f"test_indices \n {test_indices}")
        print("train set", X.loc[X.Time.isin(train_dates), ['Time', 'Product']])
        print("test set", X.loc[X.Time.isin(test_dates),['Time', 'Product']])
        yield test_indices

def get_n_splits(self, X=None, y=None, groups=None):
    unique_dates = np.unique(X['Time'])
    n_dates = len(unique_dates)
    return n_dates - self.min_train_size

To further investigate, I printed the train and test set dates outside of the custom cross-validation loop using the RandomizedSearchCV object. However, I observed that the train and test sets still had overlapping dates, which contradicted the earlier prints.

tscv = ExpandingWindowCV(min_train_size=5, shuffle=False, random_state=42)
search =  RandomizedSearchCV(pipeline, search_space, cv=tscv, scoring=custom_scorer, error_score='raise', n_jobs=2, n_iter=1, refit=True, verbose=0)
for train, test in search.cv.split(X):
  print('TRAIN: ', train, ' TEST: ', test)
  print(f"  Train: index= \n{train} \n  values= \n{X.loc[X.index.isin(train),['Time', 'Submodel']].sort_values('Time')}")
  print(f"  Test: index= \n{test} \n  values= \n{X.loc[X.index.isin(test),['Time', 'Submodel']].sort_values('Time')}")
  print(f" Train min= \n{X.loc[X.index.isin(train), 'Time'].min()}")
  print(f" Train max= \n{X.loc[X.index.isin(train), 'Time'].max()}")
  print(f" Test min= \n{X.loc[X.index.isin(test), 'Time'].min()}")
  print(f" Test max= \n{X.loc[X.index.isin(test), 'Time'].max()}")

I suspect there might be an inconsistency or issue when passing the cross-validation object to RandomizedSearchCV or during the hyperparameter tuning process.

I would appreciate any insights or suggestions to resolve this issue and properly perform cross-validation on my time series data.

devcloud
  • 391
  • 5
  • 18
  • Just out of interest: Why arent you using `sklearn.model_selection.TimeSeriesSplit`? – DataJanitor Jun 27 '23 at 09:22
  • I will need to customize the folds in a way that certain products should be only in the test fold etc. Before proceeding further, I need to resolve this problem. – devcloud Jun 27 '23 at 12:04

0 Answers0