In my data, several entries correspond to a single subject and I don't won't to mix those entries between the train and the test set. For this reason, I looked at the GroupKFold
fold iterator, that according to the sklearn
documentation is a "K-fold iterator variant with non-overlapping groups."
Therefore, I would like to implement nested cross-validation using GroupKFold
to split test and train set.
I started from the template given in this question. However, I got into an error calling the fit
method on the grid instance saying that groups
has not the same shape of X
and the y
. To solve that, I sliced groups
too using the train index.
Is this implementation correct? I mostly care about not mixing data from the same groups between train and test set.
inner_cv = GroupKFold(n_splits=inner_fold)
outer_cv = GroupKFold(n_splits=out_fold)
for train_index, test_index in outer_cv.split(x, y, groups=groups):
x_train, x_test = x[train_index], x[test_index]
y_train, y_test = y[train_index], y[test_index]
grid = RandomizedSearchCV(estimator=model,
param_distributions=parameters_grid,
cv=inner_cv,
scoring=get_scoring(),
refit='roc_auc_scorer',
return_train_score=True,
verbose=1,
n_jobs=jobs)
grid.fit(x_train, y_train, groups=groups[train_index])
prediction = grid.predict(x_test)