How to merge two torch.utils.data dataloaders with a single operation

Question

I have two dataloaders and I would like to merge them without redefining the datasets, in my case train_dataset and val_dataset.

train_loader = DataLoader(train_dataset, batch_size = 512, drop_last=True,shuffle=True)
val_loader = DataLoader(val_dataset, batch_size = 512, drop_last=False)

Wanted result:

train_loader = train_loader + val_loader

how is your question different to: https://stackoverflow.com/questions/60840500/pytorch-concatenating-datasets-before-using-dataloader? — Charlie Parker, Sep 27 '22 at 01:29
easiest solution to what I want is to do use this: https://discuss.pytorch.org/t/does-concatenate-datasets-preserve-class-labels-and-indices/62611/12?u=brando_miranda by using learn2learn's union of data sets. — Charlie Parker, Sep 27 '22 at 02:00
useful: https://stackoverflow.com/questions/69792591/combing-two-torchvision-dataset-objects-into-a-single-dataloader-in-pytorch?noredirect=1#comment130421381_69792591 — Charlie Parker, Sep 27 '22 at 02:08

score 6 · Answer 1 · answered Jan 07 '21 at 23:59

Data loaders are iterators, you can implement a function that returns an iterator which yields the dataloaders' content, one dataloader after the other.

Given a number of iterators itrs, it would iterate over each iterator and in turn iterate over each iterator yielding one batch at a time. A possible implementation would be as simple as:

def itr_merge(*itrs):
    for itr in itrs:
        for v in itr:
            yield v

Here is an usage example:

>>> dl1 = DataLoader(TensorDataset(torch.zeros(5, 1)), batch_size=2, drop_last=True)
>>> dl2 = DataLoader(TensorDataset(torch.ones(10, 1)), batch_size=2)

>>> for x in itr_merge(dl1, dl2):
>>>   print(x)
[tensor([[0.], [0.]])]
[tensor([[0.], [0.]])]
[tensor([[1.], [1.]])]
[tensor([[1.], [1.]])]
[tensor([[1.], [1.]])]
[tensor([[1.], [1.]])]
[tensor([[1.], [1.]])]

score 4 · Answer 2 · answered Aug 31 '21 at 11:08

4

There is a ConcatDataset available, documented in https://pytorch.org/docs/stable/_modules/torch/utils/data/dataset.html#ConcatDataset. You could concatenate the datasets before passing them to the DataLoader

import torch
from torch.utils.data import TensorDataset, ConcatDataset, DataLoader
dsa = TensorDataset(torch.rand(100, 3), torch.rand(100, 1) )
dsb = TensorDataset(torch.rand(150, 3), torch.rand(150, 1) )

dsab_cat = ConcatDataset([dsa, dsb])
dsab_cat_loader = DataLoader(dsab_cat)

refs: https://www.oreilly.com/library/view/deep-learning-with/9781789534092/5f2cf6d8-4cdf-4e83-8c5b-58fbf722f6b6.xhtml

answered Aug 31 '21 at 11:08

bluesmonk

1,237
13
31

I'd use the index produced by `enumerate` when iterating over the dataset and use that index as a class. But it's hard to tell what your use case is. I'd suggest posting a new question with a self-contained example. – bluesmonk Sep 27 '22 at 14:52
that is what I was going to do. – Charlie Parker Sep 27 '22 at 17:55

score 1 · Answer 3 · answered Nov 20 '21 at 11:51

1

Returns a list of tensors that you could iterate for training like how you usually iterate a DataLoader:

trainval = [d for dl in [train_loader, val_loader] for d in dl]

answered Nov 20 '21 at 11:51

fiesaratnu

11
1

How to merge two torch.utils.data dataloaders with a single operation

3 Answers3