Will using image augmentation techniques in pytorch increase the dataset size on local machine also

Question

I was training a custom model in pytorch and the dataset was very uneven. As in there are 10 classes for which some class have only 800 images while some have 4000 images. I found that image augmentation was a solution for my problem to avoid overfitting. But i got confused in between while implementing, the below codes were used to alter the features of the images

loader_transform = transforms.Compose([
transforms.RandomRotation(30),
transforms.RandomResizedCrop(140),
transforms.RandomHorizontalFlip()
])

but while training it shows the same original number of images where did the newly created augmented dataset go. And if i want to save it on my local machine and to make all classes even what can be done??

score 0 · Accepted Answer · answered Jun 16 '22 at 07:27

It looks like you are using online augmentations, If you like to use offline please do a pre-processing step that saves the images and then use them in the training step

Please make sure you understand the difference between online augmentations and offline augmentations

Offline or pre-processing Augmentation

To increase the size of the data set, enhancement is applied as a pre-processing step. Usually, we do this when we want to expand a small training data set. When applying to larger data sets, we have to consider disk space

Online or real-time Augmentation

The augmentation is being applied in real-time through random augmentations. Since the augmented images do not need to be saved on the disk, this method is usually applied to large data sets. At each epoch, the online augmentation model will see a different image.

Ohhh yeah generally people say apply image augmentation techniques but this seems like the basic foundation i skipped upon looking. Thanks for the answer tho it seems insightful — Manjunath D, Jun 16 '22 at 19:04

score 0 · Answer 2 · answered Jun 16 '22 at 07:43

Hard to tell without seeing your dataset/dataloader, but I suspect you're simply applying transformations to your dataset, this won't change the dataset size, just augment the existing images. If you wish to balance classes, adding a sampler seems the easiest solution.

Here's (somewhat simplified) code I use for this purpose, utilizing pandas, collections and torch.utils.data.WeightedRandomSampler. Likely not the best out there but it gets the job done:

# Note the trasnformations should include ToTensor() in this case.
data = datasets.ImageFolder('/path/to/images', transform=loader_transform)

# Split into train/test sets:
train_len = int(len(data)*0.8)
train_set, test_set = random_split(data, [train_len, len(data) - train_len])

# Extract classes:
train_classes = [train_set.dataset.targets[i] for i in train_set.indices]
# Calculate support:
class_count = Counter(train_classes)
# Calculate class weights:
class_weights = torch.DoubleTensor([len(train_classes)/c for c in pd.Series(class_count).sort_index().values]) 
# Sampler needs the respective class weight supplied for each image in the dataset:
sample_weights = [class_weights[train_set.dataset.targets[i]] for i in train_set.indices]

sampler = WeightedRandomSampler(weights=sample_weights, num_samples=int(len(train_set)*2), replacement=True)

# Create torch dataloaders:
train_loader = DataLoader(train_set, batch_size=4, sampler=sampler, num_workers=12)
print("The number of images in a training set is:", len(train_loader)*batch_size)

test_loader = DataLoader(test_set, batch_size=4, shuffle=False, num_workers=12)
print("The number of images in a test set is:", len(test_loader)*batch_size)

Final train size will be 2x the original in this case, however you may experiment with smaller sizes too, the class representation will be balanced regardless of the size chosen.

why do we need to calculate the class weights above, generally i never used the sampler parameter for creating dataloaders. If you can explain how will using sampler help it would be glad ig Thanks for the response anyways — Manjunath D, Jun 16 '22 at 19:02
You're dealing with imbalanced classes. It is a commonplace that if there're two classes with 4000 and 800 samples respectively, the model that always predicts the majority class would yield 83% accuracy while being entirely useless. Ideally, we need to make sure each class provides a similar amount of samples when training (or assign weights to the samples), and/or adjust the classification threshold. — dx2-66, Jun 17 '22 at 07:58
Without augmentations, the sampler would repeat the existing images or drop excess images to achieve the class balance, according to the num_samples of your choice. Transformations without a sampler will just randomly augment every original image once per epoch. Sampler + transformations will also randomly augment the image each time it is sampled, ensuring the samples of the same image are slightly different. — dx2-66, Jun 17 '22 at 07:58

Will using image augmentation techniques in pytorch increase the dataset size on local machine also

2 Answers2

Linked