Creating a train, test split for data nested in multiple folders

Question

I am preparing my data for training an image recognition model. I currently have one folder (the dataset) that contains multiple folders with the names of the labels and these folders have the images inside them.

I want to somehow split this dataset so that I have two main folders with the same subfolders, but the number of images inside these folders to be according to a preferred train/test split, so for instance 90% of the images in the train dataset and 10% in the test dataset.

I am struggling with finding the best way how to split my data. I have read a suggestion that pytorch torch.utils.Dataset class might be a way to do it but I can't seem to get it working as to preserve the folder hierarchy.

Nikaido · Accepted Answer · 2020-11-09T20:46:48.247

If you have a folder structure like this:

folder
│     
│
└───class1
│   │   file011
│   │   file012
│   
└───class2
    │   file021
    │   file022

You can use simply the class torchvision.datasets.ImageFolder

As stated from the website of pytorch

A generic data loader where the images are arranged in this way:
root/dog/xxx.png
root/dog/xxy.png
root/dog/xxz.png

root/cat/123.png
root/cat/nsdf3.png
root/cat/asd932_.png

Then, after you have created your ImageFolder instance, like this for example

dataset = torchvision.datasets.Imagefolder(YOUR_PATH, ...)

you can split it in this way:

test_size = 0.1 * len(dataset)
test_set = torch.utils.data.Subset(dataset, range(test_size))  # take 10% for test
train_set = torch.utils.data.Subset(dataset, range(test_size, len(dataset)) # the last part for train

If you want to make a shuffle of the split, remember that the class subset uses the indexes for the split. So you can shuffle, and split them. Doing something like this

indexes = shuffle(range(len(dataset)))
indexes_train = indexes[:int(len(dataset)*0.9)]
indexes_test = = indexes[int(len(dataset)*0.9):]

Thank you. Do you know if this class has a functionality to split the data randomly each time? This method will likely introduce bias in multiple training sessions. — smejak, Nov 09 '20 at 20:43
I wonder though whether this method won't lead to some problems down the line, because the type of my initial dataset is torchvision.datasets.folder.ImageFolder , however, the subset is torch.utils.data.dataset.Subset. Can I still apply transforms on the Subset and then feed it directly to the NN? — smejak, Nov 10 '20 at 11:44
@smejak I don't think there would be problems. it is done in the same way also in this answer: https://stackoverflow.com/questions/57246630/how-to-split-data-into-train-and-test-sets-using-torchvision-datasets-imagefolde — Nikaido, Nov 10 '20 at 13:23

Creating a train, test split for data nested in multiple folders

1 Answers1