How to obtain reproducible but distinct instances of GroupKFold

Question

In the GroupKFold source, the random_state is set to None

    def __init__(self, n_splits=3):
    super(GroupKFold, self).__init__(n_splits, shuffle=False,
                                     random_state=None)

Hence, when run multiple times (code from here)

import numpy as np
from sklearn.model_selection import GroupKFold

for i in range(0,10):
    X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
    y = np.array([1, 2, 3, 4])
    groups = np.array([0, 0, 2, 2])
    group_kfold = GroupKFold(n_splits=2)
    group_kfold.get_n_splits(X, y, groups)

    print(group_kfold)

    for train_index, test_index in group_kfold.split(X, y, groups):
        print("TRAIN:", train_index, "TEST:", test_index)
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        print(X_train, X_test, y_train, y_test)
    print 
    print

o/p

GroupKFold(n_splits=2)
('TRAIN:', array([0, 1]), 'TEST:', array([2, 3]))
(array([[1, 2],
       [3, 4]]), array([[5, 6],
       [7, 8]]), array([1, 2]), array([3, 4]))
('TRAIN:', array([2, 3]), 'TEST:', array([0, 1]))
(array([[5, 6],
       [7, 8]]), array([[1, 2],
       [3, 4]]), array([3, 4]), array([1, 2]))


GroupKFold(n_splits=2)
('TRAIN:', array([0, 1]), 'TEST:', array([2, 3]))
(array([[1, 2],
       [3, 4]]), array([[5, 6],
       [7, 8]]), array([1, 2]), array([3, 4]))
('TRAIN:', array([2, 3]), 'TEST:', array([0, 1]))
(array([[5, 6],
       [7, 8]]), array([[1, 2],
       [3, 4]]), array([3, 4]), array([1, 2]))

etc ...

The splits are identical.

How do I set a random_state for GroupKFold in order to get a different (but repoducible) set of splits over a few different trials of cross validation?

Eg, I want

GroupKFold(n_splits=2, random_state=42)
('TRAIN:', array([0, 1]), 
  'TEST:', array([2, 3]))

('TRAIN:', array([2, 3]), 
'TEST:', array([0, 1]))


GroupKFold(n_splits=2, random_state=13)
('TRAIN:', array([0, 2]), 
 'TEST:', array([1, 3]))

('TRAIN:', array([1, 3]), 
'TEST:', array([0, 2]))

So far, it seems a strategy might be to use a sklearn.utils.shuffle first, as suggested in this post. However, this actually just rearranges the elements of each fold --- it doesn't give us new splits.

from sklearn.utils import shuffle
from sklearn.model_selection import GroupKFold
import numpy as np
import sys
import pdb

random_state = int(sys.argv[1])


X = np.arange(20).reshape((10,2))
y = np.arange(10)
groups = np.array([0,0,0,1,2,3,4,5,6,7])

def cv(X, y, groups, random_state):
    X_s, y_s, groups_s = shuffle(X,y, groups, random_state=random_state)
    cv_out = GroupKFold(n_splits=2)
    cv_out_splits = cv_out.split(X_s, y_s, groups_s)
    for train, test in cv_out_splits:
        print "---"
        print X_s[test]
        print y_s[test]
        print "test groups", groups_s[test]
        print "train groups", groups_s[train]
    pdb.set_trace()
print "***"
cv(X, y, groups, random_state)

The output:

>python sshuf.py 32

***
---
[[ 2  3]
 [ 4  5]
 [ 0  1]
 [ 8  9]
 [12 13]]
[1 2 0 4 6]
test groups [0 0 0 2 4]
train groups [7 6 1 3 5]
---
[[18 19]
 [16 17]
 [ 6  7]
 [10 11]
 [14 15]]
[9 8 3 5 7]
test groups [7 6 1 3 5]
train groups [0 0 0 2 4]

>python sshuf.py 234

***
---
[[12 13]
 [ 4  5]
 [ 0  1]
 [ 2  3]
 [ 8  9]]
[6 2 0 1 4]
test groups [4 0 0 0 2]
train groups [7 3 1 5 6]
---
[[18 19]
 [10 11]
 [ 6  7]
 [14 15]
 [16 17]]
[9 5 3 7 8]
test groups [7 3 1 5 6]
train groups [4 0 0 0 2]

I think this is a bug. I opened a bug report. If I have time after work I may go fix it myself. https://github.com/scikit-learn/scikit-learn/issues/9323 — Him, Jul 11 '17 at 14:37

joeln · Accepted Answer · 2017-07-17T07:00:13.170

20

KFold is only randomized if shuffle=True. Some datasets should not be shuffled.
GroupKFold is not randomized at all. Hence the random_state=None.
GroupShuffleSplit may be closer to what you're looking for.

A comparison of the group-based splitters:

In GroupKFold, the test sets form a complete partition of all the data.
LeavePGroupsOut leaves all possible subsets of P groups out, combinatorially; test sets will overlap for P > 1. Since this means P ** n_groups splits altogether, often you want a small P, and most often want LeaveOneGroupOut which is basically the same as GroupKFold with k=1.
GroupShuffleSplit makes no statement about the relationship between successive test sets; each train/test split is performed independently.

As an aside, Dmytro Lituiev has proposed an alternative GroupShuffleSplit algorithm which is better at getting the right number of samples (not merely the right number of groups) in the test set for a specified test_size.

edited Jul 17 '17 at 07:00

answered Jul 12 '17 at 10:57

joeln

3,563
25
31

I see about `GroupKFold` ---I misunderstood randomization in terms of `random_state`. I am confused about the difference between `GroupShuffleSplit` and `GroupKFold`. Eg with 3 splits, `GroupKFold` produces 3 unique test sets. `GroupShuffleSplit` however might generate (with low probability) 3 test sets that are the same? – Sam Weisenthal Jul 15 '17 at 14:52
Does `GroupShuffleSplit` guarantee that the same groups are not represented in both the testing and training sets? It would be nice to be able to specify `replacement=False` – Sam Weisenthal Jul 15 '17 at 14:53
Actually for this matter what's the difference between `LeavePGroupsOut` and `GroupKFold`? – Sam Weisenthal Jul 15 '17 at 14:54
2

Thank you for your edits, but I still don't see how to use this to produce test sets that form a complete partition of the data while also taking some `random_state` so that I can run this multiple times without getting multiple identical cv results. The best option seems to be (suggested above) shuffling and then using GroupKFold, but this doesn't necessarily behave well I find when `GroupKFold` is wrapped inside a function. `GroupShuffleSplit` is not really k fold cross validation and I am not sure whether it benefits from the same properties.... – Sam Weisenthal Jul 17 '17 at 15:04
...but I have read (eg, Elements of Statistical Learning) that splitting the data into random test train splits is *not* as good as cross validation. – Sam Weisenthal Jul 17 '17 at 15:06
I suppose I could find the number of groups *P* in each fold from *k* fold and then set *P* for `LeavePGroupsOut`, but I already have a trial I have run using GroupKFold, so I would like that to be my first experiment and now simply change the `random_state` to produce more iterations with different splits, but still using grouped cross validation as in my first trial. @joeln – Sam Weisenthal Jul 17 '17 at 15:08
I rescind the comment "this doesn't necessarily behave well I find when `GroupKFold` is wrapped inside a function." – Sam Weisenthal Jul 17 '17 at 15:42
If you want shuffled k-fold cross validation (e.g. repeated k-fold cross validation), it should be asymptotically identical to repeated Shuffle-Split. I don't have ELS lying around, but are you sure they're not talking about a *single* train-test split? – joeln Jul 18 '17 at 00:06
https://github.com/scikit-learn/scikit-learn/pull/5396 might offer you an implementation (albeit with a deprecated API) with the shuffling you seek. – joeln Jul 18 '17 at 00:06
@user0 Why did you accept the answer if you "still don't see how to use this to produce test sets that form a complete partition of the data while also taking some `random_state`"? As for myself, I will be using @xvr's solution. – Corey Levinson Dec 31 '22 at 16:57

xrr · Answer 2 · 2019-01-18T14:02:39.750

Inspired by user0's answer (can't comment) but faster:

def RandomGroupKFold_split(groups, n, seed=None):  # noqa: N802
    """
    Random analogous of sklearn.model_selection.GroupKFold.split.

    :return: list of (train, test) indices
    """
    groups = pd.Series(groups)
    ix = np.arange(len(groups))
    unique = np.unique(groups)
    np.random.RandomState(seed).shuffle(unique)
    result = []
    for split in np.array_split(unique, n):
        mask = groups.isin(split)
        train, test = ix[~mask], ix[mask]
        result.append((train, test))

    return result

score 2 · Answer 3 · answered Jul 17 '17 at 19:52

My solution so far has been to simply randomly split the groups. This could lead to very unbalanced groups (which I think GroupKFold was designed to ward off), but the hope is that the number of observations per group is small.

from sklearn.utils import shuffle
from sklearn.model_selection import GroupKFold
from numpy.random import RandomState
import numpy as np
import sys
import pdb

random_state = int(sys.argv[1])


X = np.arange(20).reshape((10,2))


y = np.arange(10)
groups = np.array([0,0,0,1,2,3,4,5,6,7])
for el in zip(range(len(y)),X,y,groups):
    print "ix, X, y, groups", el

def RandGroupKfold(groups, n_splits, random_state=None, shuffle_groups=False):

    ix = np.array(range(len(groups)))
    unique_groups = np.unique(groups)
    if shuffle_groups:
        prng = RandomState(random_state)
        prng.shuffle(unique_groups)
    splits = np.array_split(unique_groups, n_splits)
    train_test_indices = []

    for split in splits:
        mask = [el in split for el in groups]
        train = ix[np.invert(mask)]
        test = ix[mask]
        train_test_indices.append((train, test))
    return train_test_indices

splits = RandGroupKfold(groups, n_splits=3, random_state=random_state, shuffle_groups=True)

for train, test in splits:
    print "---"
    for el in zip(train, X[train], y[train], groups[train]):
        print "train ix, X, y, groups", el
    for el in zip(test, X[test], y[test], groups[test]):
        print "test ix, X, y, groups", el

Data:

ix, X, y, groups (0, array([0, 1]), 0, 0)
ix, X, y, groups (1, array([2, 3]), 1, 0)
ix, X, y, groups (2, array([4, 5]), 2, 0)
ix, X, y, groups (3, array([6, 7]), 3, 1)
ix, X, y, groups (4, array([8, 9]), 4, 2)
ix, X, y, groups (5, array([10, 11]), 5, 3)
ix, X, y, groups (6, array([12, 13]), 6, 4)
ix, X, y, groups (7, array([14, 15]), 7, 5)
ix, X, y, groups (8, array([16, 17]), 8, 6)
ix, X, y, groups (9, array([18, 19]), 9, 7)

Random state as 4

---
train ix, X, y, groups (0, array([0, 1]), 0, 0)
train ix, X, y, groups (1, array([2, 3]), 1, 0)
train ix, X, y, groups (2, array([4, 5]), 2, 0)
train ix, X, y, groups (3, array([6, 7]), 3, 1)
train ix, X, y, groups (4, array([8, 9]), 4, 2)
train ix, X, y, groups (7, array([14, 15]), 7, 5)
train ix, X, y, groups (8, array([16, 17]), 8, 6)
test ix, X, y, groups (5, array([10, 11]), 5, 3)
test ix, X, y, groups (6, array([12, 13]), 6, 4)
test ix, X, y, groups (9, array([18, 19]), 9, 7)
---
train ix, X, y, groups (4, array([8, 9]), 4, 2)
train ix, X, y, groups (5, array([10, 11]), 5, 3)
train ix, X, y, groups (6, array([12, 13]), 6, 4)
train ix, X, y, groups (8, array([16, 17]), 8, 6)
train ix, X, y, groups (9, array([18, 19]), 9, 7)
test ix, X, y, groups (0, array([0, 1]), 0, 0)
test ix, X, y, groups (1, array([2, 3]), 1, 0)
test ix, X, y, groups (2, array([4, 5]), 2, 0)
test ix, X, y, groups (3, array([6, 7]), 3, 1)
test ix, X, y, groups (7, array([14, 15]), 7, 5)
---
train ix, X, y, groups (0, array([0, 1]), 0, 0)
train ix, X, y, groups (1, array([2, 3]), 1, 0)
train ix, X, y, groups (2, array([4, 5]), 2, 0)
train ix, X, y, groups (3, array([6, 7]), 3, 1)
train ix, X, y, groups (5, array([10, 11]), 5, 3)
train ix, X, y, groups (6, array([12, 13]), 6, 4)
train ix, X, y, groups (7, array([14, 15]), 7, 5)
train ix, X, y, groups (9, array([18, 19]), 9, 7)
test ix, X, y, groups (4, array([8, 9]), 4, 2)
test ix, X, y, groups (8, array([16, 17]), 8, 6)

Random state as 5

---
train ix, X, y, groups (0, array([0, 1]), 0, 0)
train ix, X, y, groups (1, array([2, 3]), 1, 0)
train ix, X, y, groups (2, array([4, 5]), 2, 0)
train ix, X, y, groups (3, array([6, 7]), 3, 1)
train ix, X, y, groups (5, array([10, 11]), 5, 3)
train ix, X, y, groups (7, array([14, 15]), 7, 5)
train ix, X, y, groups (8, array([16, 17]), 8, 6)
test ix, X, y, groups (4, array([8, 9]), 4, 2)
test ix, X, y, groups (6, array([12, 13]), 6, 4)
test ix, X, y, groups (9, array([18, 19]), 9, 7)
---
train ix, X, y, groups (4, array([8, 9]), 4, 2)
train ix, X, y, groups (5, array([10, 11]), 5, 3)
train ix, X, y, groups (6, array([12, 13]), 6, 4)
train ix, X, y, groups (8, array([16, 17]), 8, 6)
train ix, X, y, groups (9, array([18, 19]), 9, 7)
test ix, X, y, groups (0, array([0, 1]), 0, 0)
test ix, X, y, groups (1, array([2, 3]), 1, 0)
test ix, X, y, groups (2, array([4, 5]), 2, 0)
test ix, X, y, groups (3, array([6, 7]), 3, 1)
test ix, X, y, groups (7, array([14, 15]), 7, 5)
---
train ix, X, y, groups (0, array([0, 1]), 0, 0)
train ix, X, y, groups (1, array([2, 3]), 1, 0)
train ix, X, y, groups (2, array([4, 5]), 2, 0)
train ix, X, y, groups (3, array([6, 7]), 3, 1)
train ix, X, y, groups (4, array([8, 9]), 4, 2)
train ix, X, y, groups (6, array([12, 13]), 6, 4)
train ix, X, y, groups (7, array([14, 15]), 7, 5)
train ix, X, y, groups (9, array([18, 19]), 9, 7)
test ix, X, y, groups (5, array([10, 11]), 5, 3)
test ix, X, y, groups (8, array([16, 17]), 8, 6)

score 1 · Answer 4 · answered Jul 11 '17 at 14:10

Subclass and implement

a random_state dependent _iter_test_masks( ... random_state = None ) method, as was self-documented in the sci-kit super(...)'s source. The random_state parameter, used in instantiation ( .__init__() is "just" stored and left for user's creativity, if it will be or will not be used in any customised manner for a test_mask generation ( as literally expressed in sci-kit source comments ):

(cit.:)

# Since subclasses must implement either _iter_test_masks or
# _iter_test_indices, neither can be abstract.

def _iter_test_masks(self, X=None, y=None, groups=None):
    """Generates boolean masks corresponding to test sets.

    By default, delegates to _iter_test_indices(X, y, groups)
    """
    for test_index in self._iter_test_indices(X, y, groups):
        test_mask = np.zeros(_num_samples(X), dtype=np.bool)
        test_mask[test_index] = True

    yield test_mask

Defining a process, that becomes dependent on externally provided random_state != None ought also perform a fair practice to protect - save / store the actual current state of the RNG ( RNG_stateTUPLE = numpy.random.get_state() ), set the one provided from .__init__() calling interface and after having been finished, restore the RNG state from the saved one ( numpy.random.set_state( RNG_stateTUPLE ) ).

This way such a custom-process gets both the required dependence on a random_state value, and reproducibility. Q.E.D.

I am not sure I follow; can you show on the toy example how it would be implemented? — Sam Weisenthal, Jul 11 '17 at 14:19

score 0 · Answer 5 · answered Jun 27 '19 at 04:09

I wanted to combine the code for groups k-fold and also wanted the same proportion of classes in the train and test set. So, I ran stratified k-fold over the groups such that same ratio of classes is maintained in the folds and then used the groups to place samples in the folds. I also included the random seed in the stratified to solve the different splits issue.

def Stratified_Group_KFold(Y, groups, n, seed=None):
    unique = np.unique(groups)
    group_Y = []
    for group in unique:
        y = Y[groups.index(subject)]
        group_Y.append(y)

    group_X = np.zeros_like(unique)
    skf_group = StratifiedKFold(n_splits = n, random_state = seed, shuffle=True)

    result = []
    for train_index, test_index in skf_group.split(group_X, group_Y):
        train_groups_in_fold = unique[train_index]
        test_groups_in_fold = unique[test_index]

        train = np.in1d(groups, train_groups_in_fold).nonzero()[0]
        test = np.in1d(groups, test_groups_in_fold).nonzero()[0]

        result.append((train, test))


    return result

score 0 · Answer 6 · answered Aug 20 '19 at 17:09

@user0

Eg, I want

   GroupKFold(n_splits=2, random_state=42)
   ('TRAIN:', array([0, 1]), 
    'TEST:', array([2, 3]))

   ('TRAIN:', array([2, 3]), 
    'TEST:', array([0, 1]))

   GroupKFold(n_splits=2, random_state=13)
   ('TRAIN:', array([0, 2]), 
    'TEST:', array([1, 3]))

   ('TRAIN:', array([1, 3]), 
    'TEST:', array([0, 2]))

This second split would split a group into both the training and test set. This is what GroupKFold is supposed to avoid. For example, in the second split an element from group 0 (indicies 0 and 1 in the dataset) is in both the training and test sets as indicies 0 and 1, respectively.

For the example you give, there isn't more than one way to do a grouped 2-fold split, since you only have 2 groups.

bernie · Answer 7 · 2021-03-22T18:17:50.813

GroupKFold appears deterministic based on the group labels. So the solution is to assign new labels. I approach this by shuffling the list of unique group identifiers and assigning new labels from 0 to n_groups - 1.

import numpy as np
from sklearn.model_selection import GroupKFold

def get_random_labels(labels, random_state):
    labels_shuffled = np.unique(labels)
    # shuffle works in place
    random_state.shuffle(labels_shuffled)
    new_labels_mapping = {k: i for i, k in enumerate(labels_shuffled)}
    new_labels = np.array([new_labels_mapping[label] for label in labels])
    reverse_dict = {v: k for k, v in new_labels_mapping.items()}
    return new_labels, reverse_dict

random_state = np.random.RandomState(41)
X = np.arange(20).reshape((10, 2))
y = np.arange(10)
groups = np.array([0, 0, 0, 1, 2, 3, 4, 5, 6, 7])

for _ in range(0, 5):
    group_kfold = GroupKFold(n_splits=2)
    new_labels, reverse_dict = get_random_labels(groups, random_state)
    
    print(group_kfold)

    for i, (train_index, test_index) in enumerate(group_kfold.split(X, y, new_labels)):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        groups_train, groups_test = groups[train_index], groups[test_index]
        print("Split no.", i + 1, "Training y:", y_train, "Testing y:", y_test)
    print()

output:

GroupKFold(n_splits=2)
Split no. 1 Training y: [3 4 5 6 8] Testing y: [0 1 2 7 9]
Split no. 2 Training y: [0 1 2 7 9] Testing y: [3 4 5 6 8]

GroupKFold(n_splits=2)
Split no. 1 Training y: [3 4 7 8 9] Testing y: [0 1 2 5 6]
Split no. 2 Training y: [0 1 2 5 6] Testing y: [3 4 7 8 9]

GroupKFold(n_splits=2)
Split no. 1 Training y: [3 6 7 8 9] Testing y: [0 1 2 4 5]
Split no. 2 Training y: [0 1 2 4 5] Testing y: [3 6 7 8 9]

GroupKFold(n_splits=2)
Split no. 1 Training y: [5 6 7 8 9] Testing y: [0 1 2 3 4]
Split no. 2 Training y: [0 1 2 3 4] Testing y: [5 6 7 8 9]

GroupKFold(n_splits=2)
Split no. 1 Training y: [3 4 6 7 9] Testing y: [0 1 2 5 8]
Split no. 2 Training y: [0 1 2 5 8] Testing y: [3 4 6 7 9]

In the 10 samples, I made the first three belong to group 0, and each of the others belongs to its own unique group. The result is that the split is different each iteration.

The reverse_dict object is there to fetch the identities of the original labels.

How to obtain reproducible but distinct instances of GroupKFold

7 Answers7

Subclass and implement

Linked