Splitting a list of folds into training and validation sets

Question

I have created code that splits data into folds (7 in this case). In effect, I have a list of lists of 7 folds of data.

I now want to go through these and split into training and validation sets within each fold and store these as data frames.

As a newcomer, I have tried manual methods, groupsghufflesplit split() and so on but can't get the output I need. The methods are as follows:

def k_folds(data, k):
    """function that returns a list of k folds of the data"""
    
    ############################
    len_folds = find_fold_sizes(data, k)
    ############################

    folds = []
    for i in range(k):
        data_ss = data.sample(n=len_folds[i], random_state=20)
        data = data.drop(data_ss.index)
        folds.append(data_ss)

    return folds

(len_folds is a calculate of the length of each fold - in this case around 42 or 43 as using 300 rows of data.

This returns a list of 7 folds (0-6) in one big list.

I am then trying to use code such as

for i, fold in enumerate(folds):
        # Generate the training/testing visualizations for each CV split
    gss = GroupShuffleSplit(n_splits=7, test_size=0.3)
    train_dataset,test_dataset = next(gss.split(X=data, y=data['y'], groups=data.index.values))

to create training and validation sets for each. This however gives me an output of 1 of the datasets or if I try to output a single frame using train_dataset[1] for example, I just get a number say 3.

I am an absolute beginner out of my depth so please accept my apologies if this is stupid but ant advice would be most welcome. Thank you in advance

maybe this is usefull https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html — chc, Jul 17 '23 at 10:19

score 0 · Answer 1 · answered Jul 17 '23 at 10:20

In this line for each iteration of the for-loop train_dataset and test_dataset are just redefined with new sets on indices. So in the end you always get data for the last fold.

train_dataset,test_dataset = next(gss.split(X=data, y=data['y'], groups=data.index.values))

If you want to store indices for each iteration, you can create 2 lists and append new sets of indices to them.

train_datasets, test_datasets = [], []

for i, fold in enumerate(folds):
        # Generate the training/testing visualizations for each CV split
    gss = GroupShuffleSplit(n_splits=7, test_size=0.3)
    train_dataset, test_dataset = next(gss.split(X=data, y=data['y'], groups=data.index.values))
    train_datasets.append(train_dataset)
    test_datasets.append(test_dataset)

Maria, thank you very much, this is appreciated. I am getting an error: builtin_function_or_method' object has no attribute 'values'. Is there another way around this? I am sorry to come back with a follow up but getting very close! Thank you — DWS, Jul 17 '23 at 12:43

Splitting a list of folds into training and validation sets

1 Answers1