0

I have created code that splits data into folds (7 in this case). In effect, I have a list of lists of 7 folds of data.

I now want to go through these and split into training and validation sets within each fold and store these as data frames.

As a newcomer, I have tried manual methods, groupsghufflesplit split() and so on but can't get the output I need. The methods are as follows:

def k_folds(data, k):
    """function that returns a list of k folds of the data"""
    
    ############################
    len_folds = find_fold_sizes(data, k)
    ############################

    folds = []
    for i in range(k):
        data_ss = data.sample(n=len_folds[i], random_state=20)
        data = data.drop(data_ss.index)
        folds.append(data_ss)

    return folds 

(len_folds is a calculate of the length of each fold - in this case around 42 or 43 as using 300 rows of data.

This returns a list of 7 folds (0-6) in one big list.

I am then trying to use code such as

for i, fold in enumerate(folds):
        # Generate the training/testing visualizations for each CV split
    gss = GroupShuffleSplit(n_splits=7, test_size=0.3)
    train_dataset,test_dataset = next(gss.split(X=data, y=data['y'], groups=data.index.values))

to create training and validation sets for each. This however gives me an output of 1 of the datasets or if I try to output a single frame using train_dataset[1] for example, I just get a number say 3.

I am an absolute beginner out of my depth so please accept my apologies if this is stupid but ant advice would be most welcome. Thank you in advance

Maria K
  • 1,491
  • 1
  • 3
  • 14
DWS
  • 1
  • 2
  • maybe this is usefull https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html – chc Jul 17 '23 at 10:19

1 Answers1

0

In this line for each iteration of the for-loop train_dataset and test_dataset are just redefined with new sets on indices. So in the end you always get data for the last fold.

train_dataset,test_dataset = next(gss.split(X=data, y=data['y'], groups=data.index.values))

If you want to store indices for each iteration, you can create 2 lists and append new sets of indices to them.

train_datasets, test_datasets = [], []

for i, fold in enumerate(folds):
        # Generate the training/testing visualizations for each CV split
    gss = GroupShuffleSplit(n_splits=7, test_size=0.3)
    train_dataset, test_dataset = next(gss.split(X=data, y=data['y'], groups=data.index.values))
    train_datasets.append(train_dataset)
    test_datasets.append(test_dataset)
Maria K
  • 1,491
  • 1
  • 3
  • 14
  • Maria, thank you very much, this is appreciated. I am getting an error: builtin_function_or_method' object has no attribute 'values'. Is there another way around this? I am sorry to come back with a follow up but getting very close! Thank you – DWS Jul 17 '23 at 12:43