How to load a dataset starting from list of images Pytorch

Question

I have a service that receives images in a binary format from another service (let's call it service B):

from PIL import Image

img_list = []
img_bin = get_image_from_service_B()
image = Image.open(io.BytesIO(img_bin)) # Convert bytes to image using PIL

When an image is successfully converted thanks to PIL it is also appended to a list of images.

img_list.append(image)

When I've enough images I want to load my list of images using Pytorch as if it was a dataset

if img_list.__len__() == 500:
     ### Load dataset and do a transform operation on the data

In a previous version of the software the requirement was simply to retrieve the images from a folder, so it was quite simple to load all the images

my_dataset = datasets.ImageFolder("path/to/images/folder/", transform=transform)
dataset_iterator = DataLoader(my_dataset, batch_size=1)

Now my issue is how to perform the transform and load the dataset from a list.

Try using pytorch/serve, there you can use the request batching option, I think this should do it. Or you will have to use a async queue. — Krueger, Oct 28 '20 at 14:14

score 9 · Accepted Answer · edited Jul 21 '22 at 12:31

9

You can simply write a custom dataset:

class MyDataset(torch.utils.data.Dataset):
    def __init__(self, img_list, augmentations):
        super(MyDataset, self).__init__()
        self.img_list = img_list
        self.augmentations = augmentations

    def __len__(self):
        return len(self.img_list)

    def __getitem__(self, idx):
        img = self.img_list[idx]
        return self.augmentations(img)

You can now plug this custom dataset into DataLoader and you are done.

edited Jul 21 '22 at 12:31

Gulzar

23,452
27
113
201

answered Oct 28 '20 at 14:15

Shai

111,146
38
238
371

1

Thank you very much, I would like to add that in my case it was also necessary to add a targets `self.targets = torch.LongTensor(my_targets) `, where `my_targets ` was basically another list. Of course `__getitem__` is also returning related target to the image – Tajinder Singh Nov 02 '20 at 14:22
1

@TajinderSingh obviously it's always nice to have targets.. – Shai Nov 02 '20 at 14:27
I am guessing this requires all images to first reside in memory? If this is impossible/undesired, would it be correct to instead use `img_path_list`, and have `__getitem__` do `cv2.imread`? Does that kind of thing have any drawback performance-wise? – Gulzar Jul 25 '22 at 09:34
Also, I must ask why are augmentations part of a dataset, rather than part of a dataloader? It would make sense a dataset should represent the single source of truth, which augmentations inherently break. – Gulzar Jul 25 '22 at 09:43
@Gulzar `self.img_list` does not have to contain the actual images, it can store paths (the usual case). You can this of `self.augmentations` as a function (or functions) that converts an item from `self.img_list` (a path or an actual image) into a tensor in the right shape and format for the training loop – Shai Jul 25 '22 at 09:46
Do you mean the 1st "augmentation" would be `cv2.imread`? Otherwise I am not following – Gulzar Jul 25 '22 at 09:48
1

@Gulzar traditionally, `augmentations` are part of the `Dataset` -- this module is in charge of producing _single_ data points for training. While `Dataloader` module is in charge of collecting these points into batches. Therefore, augmentations, that are applied to each image independently, are part of `Dataset` while considerations such as `batch_size` and `shuffle` are part of `DataLoader`. – Shai Jul 25 '22 at 09:49
@Gulzar you can think of the first "augmentation" as `pil_read` or `cv2.imread` (don't use `cv2` - it's annoying). Alternatively, you can see that `torchvision.ImageFolder` explicitly calls `self.loader` in `__getitem__` before the augmentations. – Shai Jul 25 '22 at 09:52
So to recap, in the example above, I am expected to manually pass `pil_read` as the first augmentation, and multiprocessing will be handled for me correctly in the dataloader. Correct? – Gulzar Jul 25 '22 at 09:56
1

@Gulzar that's about it. – Shai Jul 25 '22 at 10:03
Btw, why not use cv2? This is my code base, and I am quite used to it. Any functional reasoning? – Gulzar Jul 26 '22 at 19:52
@Gulzar BGR vs RGB for starters. I don't like opencv – Shai Jul 26 '22 at 21:15

How to load a dataset starting from list of images Pytorch

1 Answers1

Linked