PyTorch Dataloader across epoch

Question

In PyTorch, a dataloader cursor is used to iterate over the data during training. The cursor keeps track of the current position within the dataset, and is used to retrieve the next batch of data for training. When training across multiple epochs, the cursor should reset to the beginning of the dataset after each epoch. This allows the model to see the entire dataset multiple times during training, which can help to improve the model's performance.

How do PyTorch DataLoader reset the data cursor across epochs? Does it guarantee the reset from the beginning of the dataset?

score 1 · Answer 1 · answered Feb 22 '23 at 06:22

As you have mentioned, PyTorch's DataLoader class resets the data cursor to the beginning across epochs. This is done by creating a new iterator object over the dataset for each new epoch.

The DataLoader class uses a Sampler object to determine the order in which samples are retrieved from the dataset.

The default Sampler used by DataLoader is the "SequentialSampler". This returns samples from the dataset in sequential order. At the end of each epoch, the DataLoader creates a new SequentialSampler object and uses it to create a new iterator over the dataset. This ensures that the data cursor is reset to the beginning of the dataset at the start of each new epoch.

The order in which samples are retrieved from the dataset can be controlled by using a different Sampler object.

The "RandomSampler" can be used to retrieve samples in a random order.

However, regardless of the Sampler used, the DataLoader class guarantees that the data cursor is reset at the beginning of each epoch.

PyTorch Dataloader across epoch

1 Answers1