0

I'm currently trying to make a custom PyTorch DataLoader.

I'm aware that setting drop_last=True when first declaring the DataLoader object tells the object to drop the last incomplete batch if the size is not appropriate. However, I was wondering if that could be done in reverse, where the DataLoader computes the number of batches and counts from the back.

The reason I'm asking this is because the data that I'm currently using is time series data, and I want to use the most recent sample and therefore it would be ideal if the "leftover" samples were dropped from the oldest portion of the data.

I've thought of ways like reversing the data to start with, then creating the DataLoader object and reversing it back to the way it was, or to first reverse the data and create the object and then feed in the idx's in reverse order when running __getitem__, but this seems troublesome and prone to making errors so I was wondering if PyTorch offers this behavior.

Thanks in advance.

Sean
  • 2,890
  • 8
  • 36
  • 78

1 Answers1

1

Computing the number of samples which will be removed is relatively straightforward. Once you have that number then you could use torch.utils.data.Subset to truncate your data from the start. For example

batch_size = ... # your batch size
dataset = ... # your dataset

# compute number of samples to remove
dropped_samples = len(dataset) - batch_size * int(len(dataset) // batch_size)

subset_dataset = torch.utils.data.Subset(dataset, range(dropped_samples, len(dataset)))
loader = torch.utils.data.DataLoader(subset_dataset, ...

In this case setting drop_last=True would have no effect since len(subset_dataset) is divisible by batch_size.

jodag
  • 19,885
  • 5
  • 47
  • 66