6

I have beening using shuffle option for pytorch dataloader for many times. But I was wondering when this shuffle happens and whether it is performed dynamically during iteration. Take the following code as an example:

namesDataset = NamesDataset()
namesTrainLoader = DataLoader(namesDataset, batch_size=16, shuffle=True)
for batch_data in namesTrainLoader:
    print(batch_data)

When we define "namesTrainLoader", does that mean the shuffling is finished and the following iteration will be based on a fixed order of data? Will there be any randomness in the for loop after namesTrainLoader was defined?

I was trying to replace half of "batch_data" with some special value:

for batch_data in namesTrainLoader:
    batch_data[:8] = special_val
    pre = model(batch_data)

Let us say there will be infinite number of epoches, will "model" eventually see all the data in "namesTrainLoader"? Or half of the data of "namesTrainLoader" is actually lost to "model"?

Jim Wang
  • 421
  • 6
  • 17

2 Answers2

7

The shuffling happens when the iterator is created. In the case of the for loop, that happens just before the for loop starts.

You can create the iterator manually with:

# Iterator gets created, the data has been shuffled at this point.
data_iterator = iter(namesTrainLoader)

By default the data loader uses torch.utils.data.RandomSampler if you set shuffle=True (without providing your own sampler). Its implementation is very straight forward and you can see where the data is shuffled when the iterator is created by looking at the RandomSampler.__iter__ method:

def __iter__(self):
    n = len(self.data_source)
    if self.replacement:
        return iter(torch.randint(high=n, size=(self.num_samples,), dtype=torch.int64).tolist())
    return iter(torch.randperm(n).tolist())

The return statement is the important part, where the shuffling takes place. It simply creates a random permutation of the indices.

That means you will see your entire dataset every time you fully consume the iterator, just in a different order every time. Therefore there is no data lost (not including cases with drop_last=True) and your model will see all data at every epoch.

Michael Jungo
  • 31,583
  • 3
  • 91
  • 84
  • Thanks for the response. So my model will see all data at every epoch, even after half of batch_data is destoyed by "special_val" ? – Jim Wang May 10 '20 at 22:05
  • If you overwrite it you will not actually use that data in this particular iteration. The data you receive will cover the whole dataset, but if you decide to overwrite it or ignore it, the model won't be seeing it. But if you're asking whether that affects future iterations, then the answer is usually no, but in some rare cases where you stored the tensors in your dataset the in-place operations will affect that. That's usually not the case, since you either load the data on demand or at least only create the tensors during the batching, so even in-place operations have no effect. – Michael Jungo May 10 '20 at 22:15
  • So, is it correct to say that the Dataloader actually shuffles the indices to randomly select data in every epoch rather than shuffle the actual data itself? – Elm Liu Dec 03 '22 at 03:53
4

You can check PyTorch's implementation of torch.utils.data.DataLoader here.

If you specify shuffle=True torch.utils.data.RandomSampler will be used (SequentialSampler otherwise).

When instance of DataLoader is created nothing will be shuffled, it just instantiates necessary private members of the objects and other setup like things.

When you issue special __iter__ method during iteration as in your case a special object is returned named _SingleProcessDataLoader(self) which is a generator of data (possibly batched, shuffled etc., assuming you don't use multiprocessing).

There is a bit of a rabbit hole to follow to find all private and helper related methods, but what it basically does is it uses the underlying sampler to get indices which are used to get samples from torch.utils.data.Dataset.

Sampler is run until exhaustion and the process repeats (usually it would be a single epoch).

Will there be any randomness in the for loop after namesTrainLoader was defined?

At the start of each cycle/epoch RandomSampler shuffles the indices, so yes, it will be randomized before every epoch (when __iter__ is called and new _SingleProcessDataLoader(self) is returned) which can be done indefinitely.

[...] will "model" eventually see all the data in "namesTrainLoader"?

Yes, it most probably will see all data points eventually

Szymon Maszke
  • 22,747
  • 4
  • 43
  • 83