1

For my first Pytorch project, I have to perform image classification using a dataset containing jpg image of clouds. Im am struggling with data importation, because the train/validation/test sets are not separated and the images are located in different folders according to their class. So, the folders structure looks like this:

-dataset_folder
    -Class_1
        img1
        img2
        ...
    -Class_2
        img1
        img2
        ...
    -Class_3
        img1
        img2
        ...
    -Class_4
        img1
        img2
        ...

I saw that the ImageFolder() class could handle this kind of folder structure, but I have no idea how to combine this with separating the dataset into 3 parts.

Can someone please show me a way to do this ?

Droidux
  • 146
  • 2
  • 12

3 Answers3

1

You can write a custom Dataset class to load your data and use it in your project:

import os
import glob
import torch
from torch.utils.data import Dataset
from PIL import Image
from torchvision.transforms import ToTensor

class CustomImageDataset(Dataset):
    def __init__(self, root_dir, transform=None):
        self.root_dir = root_dir
        self.transform = transform
        self.class_folders = [f for f in os.listdir(root_dir) if os.path.isdir(os.path.join(root_dir, f))]
        self.image_paths = []
        self.labels = []

        for label, class_folder in enumerate(self.class_folders):
            img_paths = glob.glob(os.path.join(root_dir, class_folder, '*.jpg'))
            self.image_paths.extend(img_paths)
            self.labels.extend([label] * len(img_paths))

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        img_path = self.image_paths[idx]
        label = self.labels[idx]
        image = Image.open(img_path).convert('RGB')

        if self.transform:
            image = self.transform(image)

        return image, label

Check out this link for more details about the custom datasets. https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#creating-a-custom-dataset-for-your-files

After that, you can split your dataset into as many parts as you want. here is a nice answer for how to split a custom dataset into different sets using SubsetRandomSampler: How do I split a custom dataset into training and test datasets?

Ali
  • 96
  • 6
1

You can use ImageFolder to create your dataset, then pass it to torch.utils.data.random_split. It takes dataset as input.

TanjiroLL
  • 1,354
  • 1
  • 5
  • 5
0

you are probably referring to this article or something similar, but the easiest way to solve this problem is to mix the images into just one folder, create a train, validation, and test folders, and use this:

import os
import random
from shutil import copyfile

source_folder = 'path/to/folder'
train_folder = 'path/to/folder'
validation_folder = 'path/to/folder'
test_folder = 'path/to/folder'
image_filenames = os.listdir(source_folder)
random.shuffle(image_filenames)
num_train_images = int(len(image_filenames) * 0.7)
num_validation_images = int(len(image_filenames) * 0.2)
num_test_images = int(len(image_filenames) * 0.1)

for filename in image_filenames[:num_train_images]:
    source_path = os.path.join(source_folder, filename)
    target_path = os.path.join(train_folder, filename)
    copyfile(source_path, target_path)

for filename in image_filenames[num_train_images: -num_test_images]:
    source_path = os.path.join(source_folder, filename)
    target_path = os.path.join(validation_folder, filename)
    copyfile(source_path, target_path)

for filename in image_filenames[-num_test_images: ]:
    source_path = os.path.join(source_folder, filename)
    target_path = os.path.join(test_folder, filename)
    copyfile(source_path, target_path)
Jan
  • 50
  • 6
  • This was an option, but I was looking for options from Pytorch that could make the process easier. – Droidux Apr 23 '23 at 11:37