How to make sure that my dataset is equally distributed among the classes, i.e., it is stratified, size and class distribution should be balanced?

Question

I have made a simple K-fold cross validation code, now I want to do some modification such that it is balanced in size and class distribution?.

P.S: I need to use python code from scratch, sklearn is not allowed.

from random import seed
from random import randrange




def cross_validation_split(dataset, folds=3):
    dataset_split = []
    dataset_copy = list(dataset)
    fold_size = int(len(dataset) / folds)
    for i in range(folds):
        fold = []
        while len(fold) < fold_size:
            index = randrange(len(dataset_copy))
            fold.append(dataset_copy.pop(index))
        dataset_split.append(fold)
return dataset_split


seed()
dataset = [1,1,1,2,2,2,3,3,4,4,4]
folds = cross_validation_split(dataset, 2)
print(folds)

I get this as a result : [[4, 4, 3, 3, 4], [1, 1, 2, 2, 1]].

I want that to be for example, [[1,3,2,4,4],[1,2,2,4,3]]

Why not just shuffle the list before the cross validation? if your data is well distributed it should work good enough — Rotem Tal, Jun 27 '19 at 17:15
As rotem tal said, if your dataset is already balanced, then random cross validation splits will be sufficiently random as well. That may not be evident with only a few data points as you used in the example. — Abhineet Gupta, Jun 27 '19 at 17:19

How to make sure that my dataset is equally distributed among the classes, i.e., it is stratified, size and class distribution should be balanced?

0 Answers0