0

I have a large dataset (approx. 500GB and 180k data points plus labels) in a Pytorch dataloader. Until now, I used torch.utils.data.random_split to split the dataset randomly into training and validation. However, this lead to serious overfitting. Now, I want to rather use a deterministic split, i.e. based on the paths stored in the dataloader, I could figure out a non-random split. However, I have no idea how to do so... The question is: How can I get the IDs of about 10% of the data points based on some query that has a look at the information about the files stored in the data loader (e.g. the paths)?

Michael
  • 280
  • 2
  • 13
  • Can you just create two dataloaders? One train and one val. – conv3d Jan 02 '20 at 18:38
  • Good point. I guess this is the natural solution - which I would like to avoid. That's the cause of my question. :) – Michael Jan 02 '20 at 19:41
  • I can't think of a solution without re-building the `Dataloader` at least once. You can set `shuffle=False` in your `Dataloader` and then pass paths to your `Dataloader` in a specific order so that every `n` data points are for `val` and then `train` subsequently. And then set `batch_size` to `n`. – conv3d Jan 02 '20 at 20:46

1 Answers1

0

Have you used a custom dataset along with the dataloader? If the underlying dataset has some variable that stores the filenames of the individual files, you can access it using .dataloader.dataset.filename_variable.

If thats not available, you can create a custom dataset yourself, where you essentially call the original dataset itself.

Roshan Santhosh
  • 677
  • 3
  • 9