Balancing an Unbalanced Dataset with K-Fold Cross Validation

Question

I'm trying to train/validate a CNN using Pytorch on an unbalanced image dataset (class 1:250 images, class 0: 4000ish images), and right now, I've tried augmentation solely on my training set (thanks @jodag). However, my model is still learning to favor the class with significantly more images.

I want to find ways to compensate for my unbalanced data set.

I thought about using oversampling/undersampling using the imbalanced data sampler (https://github.com/ufoym/imbalanced-dataset-sampler), but I already use a sampler to select indices for my 5-fold validation. Is there a way I could implement cross-validation using the code below and also add this sampler? Similarly, is there a way to augment one label more frequently than the other? Along the lines of these questions, are there any alternative easier ways that I could address my unbalanced dataset that I haven't looked into yet?

Here's an example of what I have so far

total_set = datasets.ImageFolder(PATH)
KF_splits = KFold(n_splits= 5, shuffle = True, random_state = 42)

for train_idx, valid_idx in KF_splits.split(total_set):
    #sampler to get indices for cross validation
    train_sampler = SubsetRandomSampler(train_idx)
    valid_sampler = SubsetRandomSampler(valid_idx)

    #Use a wrapper to apply augmentation only to training set
    #These are dataloaders that pull images from the same folder but sort into validation and training sets
    #Though transforms augment only the training set, it doesn't address
    #the underlying issue of a heavily unbalanced dataset

    train_loader = torch.utils.data.DataLoader(
        WrapperDataset(total_set, transform=data_transforms['train']),
        batch_size=32, sampler=ImbalancedDatasetSampler(total_set))
    valid_loader = torch.utils.data.DataLoader(
        WrapperDataset(total_set, transform=data_transforms['val']),
        batch_size=32)

    print("Fold:" + str(i))

    for epoch in range(epochs):
        #Train/validate model below

`

Thank you for your time and help!

FWIW, anytime you do data augmentation, you should probably do that outside/prior to splitting into train/test or CV folds. Otherwise you risk inserting bias into your models towards the over- or under-sampled data — G. Anderson, Aug 21 '19 at 17:13
I previously augmented my entire dataset (manually augmented it offline to balance the dataset), but wouldn't that introduce bias to the validation set when I split it into training/validation sets? — jinsom, Aug 21 '19 at 17:15
General wisdom (as far as I've always been taught) is to use the same techniques in train and test splits, then have a truly untouched validation set without any augmentation to test the model against — G. Anderson, Aug 21 '19 at 17:19
Since it's about technique more than coding, you might also try asking this on [stats.se] or [datascience.se] for more breadth of experience — G. Anderson, Aug 21 '19 at 17:20
Thanks for the nugget. Just FYI, I think this is where I got my information from. It looks like they recommend against augmenting the entire set first. https://stats.stackexchange.com/questions/175504/how-to-do-data-augmentation-and-train-validate-split — jinsom, Aug 21 '19 at 17:29
Thanks for the link, that's fair as well. You could always try it both ways on a subset of your data and see what works best, as long as you have a true hold-out validation set. — G. Anderson, Aug 21 '19 at 17:36
Addressing your problem, I believe `WeightedRandomSampler` could be used in place of the `SubsetRandomSampler` to achieve a balanced sampling. You would need to set the weight of indices not in the split to zero and the others could be set to enforce equal probability that each class will be sampled. — jodag, Aug 21 '19 at 17:37
Sorry is it okay if I try and paraphrase what you said? It seems like you can use WeightedRandomSampler to set weights based on class for indices of images. I have a function that generates weights for all the images in my dataset. Then by setting the weights of the indices that aren't in the split of interest to zero, I can essentially implement K-folds cross validation that way? — jinsom, Aug 21 '19 at 17:51
@jinsom Kind of. Really you would want to recompute the weights for each training split since the ratio of classes may change after a random subset of indices are removed. E.g. consider you have 3 samples of each class in total_set, so the weights of every sample would be equal for uniform sampling. Then you randomly select a subset and some classes have 2 and some have 1 sample. The weights would need to be recomputed to account for the change in number of samples per class, then set all the entries not in the train split to 0 to ensure they can never be sampled. — jodag, Aug 21 '19 at 19:31
Would it possible to do stratified random sampling with the weighted sampler then? — jinsom, Aug 21 '19 at 20:21
The weighted random sampler allows you define a different probability of sampling each individual element of your dataset (note that `len(total_set)` is the same as `len(weights)`). So you can use it to make the sampled distribution anything that you want if you provide the proper weights. — jodag, Aug 22 '19 at 03:06

Balancing an Unbalanced Dataset with K-Fold Cross Validation

Here's an example of what I have so far

0 Answers0