5

Is the iteration order for a Pytorch Dataloader guaranteed to be the same (under mild conditions)?

For instance:

dataloader = DataLoader(my_dataset, batch_size=4,
                        shuffle=True, num_workers=4)
print("run 1")
for batch in dataloader:
  print(batch["index"])

print("run 2")
for batch in dataloader:
  print(batch["index"])

So far, I've tried testing it and it appears to not be fixed, same order for both runs. Is there a way to make the order the same? Thanks

edit: i have also tried doing

unlabeled_sampler = data.sampler.SubsetRandomSampler(unlabeled_indices)
unlabeled_dataloader = data.DataLoader(train_dataset, 
                sampler=unlabeled_sampler, batch_size=args.batch_size, drop_last=False)

and then iterating through the dataloader twice, but the same non-determinism results.

information_interchange
  • 2,538
  • 6
  • 31
  • 49
  • 1
    it is stable provided `shuffle=False`, in your case your explicitly requesting the data to be returned in a random order by setting `shuffle=True` – jodag Dec 13 '19 at 09:26
  • OK, good point. But it is the "same" dataloader, no? – information_interchange Dec 13 '19 at 17:32
  • 1
    same dataset not the same loader. The loader is "just" an interface to the dataset which defines, among other things, a sampler. The sampler samples your dataset in the way and order it was defined to. If you change shuffle then you're changing the sampler that the dataloader is using which can make it go from stable to unstable. You can also explicitly specify the sampler when defining the dataloader. – jodag Dec 13 '19 at 17:51
  • Thank you for clarifying! So actually I have: `unlabeled_sampler = data.sampler.SubsetRandomSampler(unlabeled_indices)` and then `unlabeled_dataloader = data.DataLoader(train_dataset, sampler=unlabeled_sampler, batch_size=args.batch_size, drop_last=False)` and the iteration order is still unstable. Any thoughts? – information_interchange Dec 13 '19 at 18:57
  • I think I understand your issue better now. I posted an answer that I believe answers your question. – jodag Dec 13 '19 at 19:49

2 Answers2

7

The short answer is no, when shuffle=True the iteration order of a DataLoader isn't stable between iterations. Each time you iterate on your loader the internal RandomSampler creates a new random order.

One way to get a stable shuffled DataLoader is to create a Subset dataset using a shuffled set of indices.

shuffled_dataset = torch.utils.data.Subset(my_dataset, torch.randperm(len(my_dataset)).tolist())
dataloader = DataLoader(shuffled_dataset, batch_size=4, num_workers=4, shuffled=False)
jodag
  • 19,885
  • 5
  • 47
  • 66
  • Thank you, let me test one more idea I have and then I will try your answer and accept. What is strange to me, is that if I set the seeds appropriately, then the internal `RandomSampler` should give the same random indices everytime, no? – information_interchange Dec 13 '19 at 22:19
  • 1
    @information_interchange I believe the randomization in `RandomSampler` occurs when a dataloader iterator is created (e.g. when you do `for label, data in dataloader:`). You would need to seed torch's random number generator (e.g. `torch.manual_seed(1234)`) with the same seed value **immediately before iterating through your dataloader** each time to ensure reproducability. This isn't ideal as any other random behavior in your system would end up being repeated as well which may not be desired. – jodag Dec 14 '19 at 00:52
  • Hey actually, I just tried this method, and sadly it doesn't work: `ValueError: sampler should be an instance of torch.utils.data.Sampler, but got sampler=[739, 841, 1892,..]` – information_interchange Dec 17 '19 at 06:18
  • Oh that's actually really interesting, you're right. That's quite surprising since this was the recommendation of one of the pytorch developers. Anyway I reverted to my first solution which will work equally as well, and I've tested to make sure. – jodag Dec 17 '19 at 09:12
0

I actually went with jodag's in-the-comments answer:

torch.manual_seed("0")

for i,elt in enumerate(unlabeled_dataloader):
    order.append(elt[2].item())
    print(elt)

    if i > 10:
        break

torch.manual_seed("0")

print("new dataloader")
for i,elt in enumerate( unlabeled_dataloader):
    print(elt)
    if i > 10:
        break
exit(1)                       

and the output:

[tensor([[-0.3583, -0.6944]]), tensor([3]), tensor([1610])]
[tensor([[-0.6623, -0.3790]]), tensor([3]), tensor([1958])]
[tensor([[-0.5046, -0.6399]]), tensor([3]), tensor([1814])]
[tensor([[-0.5349,  0.2365]]), tensor([2]), tensor([1086])]
[tensor([[-0.1310,  0.1158]]), tensor([0]), tensor([321])]
[tensor([[-0.2085,  0.0727]]), tensor([0]), tensor([422])]
[tensor([[ 0.1263, -0.1597]]), tensor([0]), tensor([142])]
[tensor([[-0.1387,  0.3769]]), tensor([1]), tensor([894])]
[tensor([[-0.0500,  0.8009]]), tensor([3]), tensor([1924])]
[tensor([[-0.6907,  0.6448]]), tensor([4]), tensor([2016])]
[tensor([[-0.2817,  0.5136]]), tensor([2]), tensor([1267])]
[tensor([[-0.4257,  0.8338]]), tensor([4]), tensor([2411])]
new dataloader
[tensor([[-0.3583, -0.6944]]), tensor([3]), tensor([1610])]
[tensor([[-0.6623, -0.3790]]), tensor([3]), tensor([1958])]
[tensor([[-0.5046, -0.6399]]), tensor([3]), tensor([1814])]
[tensor([[-0.5349,  0.2365]]), tensor([2]), tensor([1086])]
[tensor([[-0.1310,  0.1158]]), tensor([0]), tensor([321])]
[tensor([[-0.2085,  0.0727]]), tensor([0]), tensor([422])]
[tensor([[ 0.1263, -0.1597]]), tensor([0]), tensor([142])]
[tensor([[-0.1387,  0.3769]]), tensor([1]), tensor([894])]
[tensor([[-0.0500,  0.8009]]), tensor([3]), tensor([1924])]
[tensor([[-0.6907,  0.6448]]), tensor([4]), tensor([2016])]
[tensor([[-0.2817,  0.5136]]), tensor([2]), tensor([1267])]
[tensor([[-0.4257,  0.8338]]), tensor([4]), tensor([2411])]

which is as desired. However, I think jodag's main answer is still better; this is just a quick hack which works for now ;)

information_interchange
  • 2,538
  • 6
  • 31
  • 49