I am trying to load image data for model training from a self-hosted S3 storage (MinIO). Pytorch provides new datapipes with this functionality in the torchdata library.
So within my function to create the datapipe, I have these lines:
dp_s3 = IterableWrapper(list(sample_dict.keys()))
dp_s3 = dp_s3.load_files_by_s3()
dp_s3 = dp_s3.map(open_image)
dp_s3 = dp_s3.map(transform)
The problem with this approach is, that the S3 file loader datapipe returns a tuple of a string, which contains the file path on the S3 storage as label and io.BytesIO
containing the image data. However I have all labels and the files to load in a separate text files, which are loaded into sample_dict
(a dictionary mapping file paths to classification labels) in a previous step.
Question is now, how can I get the labels from sample_dict
into my mapping functions?
There seem to be two main obstacles to achieve this:
- The dataloader is multi-threaded and I get a pickle error if I add
sample_dict
to it. Also I cannot make the dictionary globally accessible for other worker threads which are handled by pytorch load_files_bys3()
is the functional name forS3FileLoader
which can only deal with S3 type file paths as input. My initial though was that I need to us a map-style datapipe for this instead of a iterable-style, but unfortunately there are no map-style S3 datapipes available.