I have created code that splits data into folds (7 in this case). In effect, I have a list of lists of 7 folds of data.
I now want to go through these and split into training and validation sets within each fold and store these as data frames.
As a newcomer, I have tried manual methods, groupsghufflesplit split() and so on but can't get the output I need. The methods are as follows:
def k_folds(data, k):
"""function that returns a list of k folds of the data"""
############################
len_folds = find_fold_sizes(data, k)
############################
folds = []
for i in range(k):
data_ss = data.sample(n=len_folds[i], random_state=20)
data = data.drop(data_ss.index)
folds.append(data_ss)
return folds
(len_folds is a calculate of the length of each fold - in this case around 42 or 43 as using 300 rows of data.
This returns a list of 7 folds (0-6) in one big list.
I am then trying to use code such as
for i, fold in enumerate(folds):
# Generate the training/testing visualizations for each CV split
gss = GroupShuffleSplit(n_splits=7, test_size=0.3)
train_dataset,test_dataset = next(gss.split(X=data, y=data['y'], groups=data.index.values))
to create training and validation sets for each. This however gives me an output of 1 of the datasets or if I try to output a single frame using train_dataset[1] for example, I just get a number say 3.
I am an absolute beginner out of my depth so please accept my apologies if this is stupid but ant advice would be most welcome. Thank you in advance