How can I get the IDs of specific items in a Pytorch dataloader-based dataset with a query?

Question

I have a large dataset (approx. 500GB and 180k data points plus labels) in a Pytorch dataloader. Until now, I used torch.utils.data.random_split to split the dataset randomly into training and validation. However, this lead to serious overfitting. Now, I want to rather use a deterministic split, i.e. based on the paths stored in the dataloader, I could figure out a non-random split. However, I have no idea how to do so... The question is: How can I get the IDs of about 10% of the data points based on some query that has a look at the information about the files stored in the data loader (e.g. the paths)?

Good point. I guess this is the natural solution - which I would like to avoid. That's the cause of my question. :) — Michael, Jan 02 '20 at 19:41
I can't think of a solution without re-building the `Dataloader` at least once. You can set `shuffle=False` in your `Dataloader` and then pass paths to your `Dataloader` in a specific order so that every `n` data points are for `val` and then `train` subsequently. And then set `batch_size` to `n`. — conv3d, Jan 02 '20 at 20:46

score 0 · Answer 1 · answered Jan 02 '20 at 23:16

Have you used a custom dataset along with the dataloader? If the underlying dataset has some variable that stores the filenames of the individual files, you can access it using .dataloader.dataset.filename_variable.

If thats not available, you can create a custom dataset yourself, where you essentially call the original dataset itself.

How can I get the IDs of specific items in a Pytorch dataloader-based dataset with a query?

1 Answers1