Nested cross-validation with GroupKFold with sklearn

Question

In my data, several entries correspond to a single subject and I don't won't to mix those entries between the train and the test set. For this reason, I looked at the GroupKFold fold iterator, that according to the sklearn documentation is a "K-fold iterator variant with non-overlapping groups." Therefore, I would like to implement nested cross-validation using GroupKFold to split test and train set.

I started from the template given in this question. However, I got into an error calling the fit method on the grid instance saying that groups has not the same shape of X and the y. To solve that, I sliced groups too using the train index.

Is this implementation correct? I mostly care about not mixing data from the same groups between train and test set.

inner_cv = GroupKFold(n_splits=inner_fold)
outer_cv = GroupKFold(n_splits=out_fold)


for train_index, test_index in outer_cv.split(x, y, groups=groups):
    x_train, x_test = x[train_index], x[test_index]
    y_train, y_test = y[train_index], y[test_index]

    grid = RandomizedSearchCV(estimator=model,
                                param_distributions=parameters_grid,
                                cv=inner_cv,
                                scoring=get_scoring(),
                                refit='roc_auc_scorer',
                                return_train_score=True,
                                verbose=1,
                                n_jobs=jobs)
    grid.fit(x_train, y_train, groups=groups[train_index])
    prediction = grid.predict(x_test)

The question you linked seems incorrect. It's a javascript question. — bernie, Nov 05 '20 at 01:51

bernie · Answer 1 · 2020-11-05T02:16:11.493

One way you can confirm that the code is doing as you intend (i.e. not mixing data between groups) is that you can pass not the GroupKFold object but the output (the indices) of GroupKFold.split to RandomizedSearchCV. e.g.

grid = RandomizedSearchCV(estimator=model,
                            param_distributions=parameters_grid,
                            cv=inner_cv.split(
                              x_train, y_train, groups=groups[train_index]),
                            scoring=get_scoring(),
                            refit='roc_auc_scorer',
                            return_train_score=True,
                            verbose=1,
                            n_jobs=jobs)
grid.fit(x_train, y_train)

I believe this leads to the same fitting result, and here you've explicitly given the indices of training/validation for each fold of the cross-validation.

As far as I can see, these two ways of doing it are equivalent, but I think the way your example is written is more elegant since you aren't providing x_train and y_train twice.

And it appears correct to slice groups using train_index, since you're only passing the sliced x and y variables to the fit method. I have to remind myself that the inner cross-validation will be doing cross-validation on the training subset of the outer cross-validation operation.

Nested cross-validation with GroupKFold with sklearn

1 Answers1