How to add custom labels to a torchdata datapipe?

Question

I am trying to load image data for model training from a self-hosted S3 storage (MinIO). Pytorch provides new datapipes with this functionality in the torchdata library.

So within my function to create the datapipe, I have these lines:

dp_s3 = IterableWrapper(list(sample_dict.keys()))
dp_s3 = dp_s3.load_files_by_s3()
dp_s3 = dp_s3.map(open_image)
dp_s3 = dp_s3.map(transform)

The problem with this approach is, that the S3 file loader datapipe returns a tuple of a string, which contains the file path on the S3 storage as label and io.BytesIO containing the image data. However I have all labels and the files to load in a separate text files, which are loaded into sample_dict (a dictionary mapping file paths to classification labels) in a previous step.

Question is now, how can I get the labels from sample_dict into my mapping functions? There seem to be two main obstacles to achieve this:

The dataloader is multi-threaded and I get a pickle error if I add sample_dict to it. Also I cannot make the dictionary globally accessible for other worker threads which are handled by pytorch
load_files_bys3() is the functional name for S3FileLoader which can only deal with S3 type file paths as input. My initial though was that I need to us a map-style datapipe for this instead of a iterable-style, but unfortunately there are no map-style S3 datapipes available.

score 0 · Accepted Answer · answered May 03 '23 at 06:34

I think I found the answer by using just plain and simple functool.partial and use that to map my function with sample_dict as fixed input:

dp_s3 = IterableWrapper(list(sample_dict.keys()))
dp_s3 = dp_s3.load_files_by_s3()
map_function = partial(map_labels, sample_dict)
dp_s3 = dp_s3.to_map_datapipe(map_function)

and

def map_labels(sample_dict, inputs):
    # do stuff here

I have still have to test this before marking the question as answered, but initial debugging look promising.

Additionally, there is already a open feature request on the torchdata repo which seems to address this problem as well.

How to add custom labels to a torchdata datapipe?

1 Answers1