Use GroupKFold in nested cross-validation using sklearn

Question

My code is based on the example on the sklearn website: https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html

I am trying to use GroupKFold in the inner and outer cv.

from sklearn.datasets import load_iris
from matplotlib import pyplot as plt
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold,GroupKFold
import numpy as np

# Load the dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

# Set up possible values of parameters to optimize over
p_grid = {"C": [1, 10, 100],
          "gamma": [.01, .1]}

# We will use a Support Vector Classifier with "rbf" kernel
svm = SVC(kernel="rbf")

# Choose cross-validation techniques for the inner and outer loops,
# independently of the dataset.
# E.g "GroupKFold", "LeaveOneOut", "LeaveOneGroupOut", etc.
inner_cv = GroupKFold(n_splits=3)
outer_cv = GroupKFold(n_splits=3)

# Non_nested parameter search and scoring
clf = GridSearchCV(estimator=svm, param_grid=p_grid, cv=inner_cv)

# Nested CV with parameter optimization
nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv, groups=y_iris)

I know that putting the y values into the groups argument is not what it is used for!! For this code I get the following error.

.../anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
ValueError: The 'groups' parameter should not be None.

Does ayone have an idea on how to solve this?

Thank you for your help in advance,

Sören

score 4 · Answer 1 · answered Oct 28 '20 at 12:14

I came across a similar problem and I found the solution of @Samalama as a good one. The only thing I needed to change was in the fit call. I had to slice the groups too, with the same shape of the X and y for the train set. Otherwise, I get an error saying that shapes of the three objects are not the same. Is that a correct implementation?

for train_index, test_index in outer_cv.split(x, y, groups=groups):
    x_train, x_test = x[train_index], x[test_index]
    y_train, y_test = y[train_index], y[test_index]

    grid = RandomizedSearchCV(estimator=model,
                                param_distributions=parameters_grid,
                                cv=inner_cv,
                                scoring=get_scoring(),
                                refit='roc_auc_scorer',
                                return_train_score=True,
                                verbose=1,
                                n_jobs=jobs)
    grid.fit(x_train, y_train, groups=groups[train_index])
    prediction = grid.predict(x_test)

score 4 · Answer 2 · edited Nov 16 '22 at 14:41

For anyone coming back to this now and like me was interested in passing GroupKFold cross-validation into cross_val_score()...

cross_val_score() accepts both cv=GroupKFold() and a groups parameter separately.

This did the trick for what I was trying to achieve.

For example:

cv_outer = GroupKFold(n_splits=n_unique_groups)
groups = X['your_group_name'] # or pass your group another way

.... ML Code ...
    
scores = cross_val_score(search, X, y, scoring='f1', cv=cv_outer, groups = groups)

score 3 · Answer 3 · answered May 09 '20 at 21:06

I have been trying to implement nested CV with GroupKFold myself, also tried to follow the example provided by sklearn which you refer to and also ended up with the same error as you, finding this thread.

I don't think the answer by ywbaek addressed the problem correctly.

After some searching, I found a few issues on sklearn Github were raised, in relation to either this specific problem or what seem to be other forms of the same problem. I think it has to do with the groups parameter not being propagated to all methods (I tried to track down where in the scripts it failed for me, but quickly got lost).

Here the issues:

As you can see these date back some time (to Oct 2016). I don't know or understand much about development, but it clearly hasn't been a priority to fix this. I guess that's fine, but the example of nested CV specifically suggests using the method provided with GroupKFold, which is not possible, and should therefore be updated.

If you still want to do a nested CV with GroupKFold, there are of course other ways to do it. An example with logistic regression:

from sklearn.model_selection import GridSearchCV, GroupKFold

pred_y = []
true_y = []

model = sklearn.linear_model.LogisticRegression()
Cs=[1,10,100]
p_grid={'C': Cs}

inner_CV = GroupKFold(n_splits = 4)
outer_CV = GroupKFold(n_splits = 4)

for train_index, test_index in outer_CV.split(X, y, groups=group):
    X_tr, X_tt = X[train_index,:], X[test_index,:]
    y_tr, y_tt = Y[train_index], Y[test_index]

    clf = GridSearchCV(estimator=model, param_grid=p_grid, cv=inner_CV)
    clf.fit(X_tr,y_tr,groups=group)

    pred = clf.predict(X_tt)   
    pred_y.extend(pred)
    true_y.extend(y_tt)

You can then evaluate predictions against truths however you like. Of course if you're still interested in comparing nested and un-nested scores, you can also collect unnested scores which I haven't done here.

score 1 · Answer 4 · answered Mar 30 '21 at 17:49

OK, so a simple solution for a nested CV will be to use the fit_params function of cross_validate:

nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv, groups=y_iris, fit_params={"groups": y_iris})

This will push down the groups into the GridSearchCV. However, what you are doing will still raise a bunch of exceptions due to some conceptual issues you have with your approach (this somewhat extends and complements @ywbaek 's answer). Let's check those out:

So, when you do a GroupKFold it will make sure that all samples from one group will be either in training or in test. You are setting those groups to the three target classes in the iris dataset ([0,1,2]).

That means the outer_cv (with n_splits=3) will create a fold with two classes in the training and the remaining class in the test.
```
for train_idx, test_idx in outer_cv.split(X_iris, y_iris, groups=y_iris):
    print(np.unique(y_iris[test_idx]))
```
This does not really make sense since the model won't learn anything about the test data. But let's go on for a moment:
In the inner_cv we then will only have two classes, which will always break GroupKFold(n_splits=3) since we will only have two possible groups.
So let's set inner_cv to GroupKFold(n_splits=2) for a moment. This fixes the previous issue. But then we will only have one class in training and one class in test. In this case, classifier will complain that there is only one class in the training data and that it can not learn anything.

So overall, while the solution above based on the fit_params parameter allows you to do a nested cross validation, it does not solve the conceptual issue you have with your approach. I hope my explanation helped to make that a little clearer.

ywbaek · Answer 5 · 2020-04-02T17:59:12.320

As you can see from the documentation for GroupKFold,
you use it when you want to have non-overlapping groups for K-fold.
It means that unless you have distinct groups of data that need to be separated when creating a K-fold, you don't use this method.

That being said, for the given example, you have to manually create groups,
which should be an array like object with the same shape as your y.
And

the number of distinct groups has to be at least equal to the number of folds

The following is the example code from the documentation:

import numpy as np
from sklearn.model_selection import GroupKFold
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 3, 4])
groups = np.array([0, 0, 2, 2])
group_kfold = GroupKFold(n_splits=2)
group_kfold.get_n_splits(X, y, groups)

You can see that groups has the same shape as y,
and it has two distinct groups 0, 2 which is the same as the number of folds.

EDITED:
get_n_splits(groups) method of GroupKFold object returns the the number of splitting iterations in the cross-validator, which we can pass in as an argument for cv keyword to cross_val_score function.

clf = GridSearchCV(estimator=svm, 
                   param_grid=p_grid, 
                   cv=inner_cv.get_n_splits(groups=y_iris))

nested_score = cross_val_score(clf, X=X_iris, y=y_iris, 
                               cv=outer_cv.get_n_splits(groups=y_iris))

Thank, that's why I used y for the example. Your example makes it more clear, but does not solve the exception in the nested cross-validation. — Sören Etler, Apr 02 '20 at 17:39
I don't see how the edited solution here works. `cv=outer_cv.get_n_splits(groups=y_iris)` means just an int will be passed to `cv`, so you'll end up getting regular cv-splitting, not grouped. — bernie, Nov 04 '20 at 20:05

cbarts · Answer 6 · 2022-10-28T06:12:23.560

For latecomers to this party, this technique doesn't require changes to the sklearn objects (e.g. LogisticRegressionCV, GridSearchCV) and works transparently in multiple sklearn settings.

I mock up a cv object that incorporates the groups:

class PseudoGroupCV:
    def __init__(self, cv_obj, groups):
        self.cv = cv_obj
        self.groups=groups
    def split(self, X,y, groups=None):
        return self.cv.split(X,y, groups=self.groups)
    def get_n_splits(self, X, y, groups):
        return self.cv.get_n_splits(X,y, groups)

You can then pass it into e.g. GridSearchCV like this:

kfold = GroupKFold(n_splits=5) # desired CV object
clf = GridSearchCV(estimator=svm, 
                    param_grid=p_grid, 
                    cv=PseudoGroupCV(kfold, groups)
                   )

This should then work as normal. It also works for Pipeline objects.

The only downside is that you need to provide the groups at class declaration (ie groups needs to match the (X, y) used for fitting).

score 0 · Answer 7 · answered Feb 19 '22 at 16:10

The answer by @Martin Becker is correct. GridSearchCV when used with GroupKFold expecting to get not only X and y, but also groups in fit method. To pass that parameter you need to use fit_params parameter of cross_val_score function.

Here is an example. To keep it simple I replaced GroupKFold with LeaveOneGroupOut.

import numpy as np
from sklearn.base import BaseEstimator
from sklearn.model_selection import \
    LeaveOneGroupOut, cross_val_score, GridSearchCV

# Create 12 samples and 4 groups [0, 1, 2] [3, 4, 5], ...
X = np.arange(12)
y = np.random.randint(0, 1, len(X))
groups = X // 3

class DummyEstimator(BaseEstimator):
    """Estimator that just prints given folds."""
    def fit(self, X, y=None):
        print('Trained on', np.unique(X // 3))
        return [0]*len(X)
    def score(self, X, y):
        print('Tested on', np.unique(X // 3))
        return 0

logo = LeaveOneGroupOut()
clf = GridSearchCV(DummyEstimator(), param_grid={}, cv=logo)
cross_val_score(
    clf, X, y, 
    cv=logo, groups=groups, fit_params={'groups': groups},
    n_jobs=None)

The code results in the following training/validation/test groups:

Trained on [2 3]  <-- First inner loop (Test fold=0, Train=1, 2, 3)
Tested on  [1]
Trained on [1 3]
Tested on  [2]
Trained on [1 2]
Tested on  [3]
Trained on [1 2 3]  <-- fit best params on the whole training data
Tested on  [0]      <-- Score on the test fold 0
Trained on [2 3]  <-- Second inner loop (Test fold=1, Train=0 2 3)
Tested on  [0]
Trained on [0 3]
Tested on  [2]
Trained on [0 2]
Tested on  [3]
Trained on [0 2 3]  <-- fit best params on the whole training data
Tested on  [1]      <-- Score on the test fold 1
... and so one

score 0 · Answer 8 · answered May 20 '22 at 12:54

It work also with RFECV:

To sum up, pass the GroupKFold to the RFECV, and pass the groups "id" to the ".fit" function.

from sklearn.model_selection import KFold, GroupKFold
from sklearn.feature_selection import RFECV

cv_outer = GroupKFold(n_splits=5)

groups =  df_train_data ['group_id'] 

estimator = GradientBoostingRegressor( verbose = 1)
selector = RFECV(estimator, step=1, cv=cv_outer,  n_jobs = -1, verbose = 1)
selector = selector.fit(X_train, y_train,    groups=groups)

print(selector.support_)


print(selector.ranking_)

Use GroupKFold in nested cross-validation using sklearn

8 Answers8

Linked