How can I pick a subset of a dataset in Pytorch?

Question

I'm trying to run https://github.com/menardai/FashionGenAttnGAN in Google Colab on GPU with the disk size of 30 GB. The code file with its dataset files are about 15 GB. After I extract this code, the disk remains is about 14 GB. When I try to run the Pretrain.py, I can see the captions are loading but suddenly I got the "Assertion Error". Since I've not got any proper answer for the cause of this error, I think that it is because of my lack of space in Colab environment. The solution that came to my mind is that write some code to tell the model to select just 30% of the train and test datasets to load. But I don't know how to do this. Can anyone help me please?

score 2 · Accepted Answer · answered May 17 '21 at 15:26

data is your total data, you can divide it how much you want just edit valid_size.

valid_size=0.3
num_train = len(data)
indices = list(range(num_train))
np.random.shuffle(indices)
split = int(np.floor(valid_size * num_train))
train_idx, valid_idx = indices[split:], indices[:split]

# define samplers for obtaining training and validation batches
train_sampler = SubsetRandomSampler(train_idx)
valid_sampler = SubsetRandomSampler(valid_idx)

# prepare data loaders (combine dataset and sampler)
train_loader = torch.utils.data.DataLoader(data, batch_size=4,
    sampler=train_sampler, num_workers=2)
valid_loader = torch.utils.data.DataLoader(data, batch_size=4, 
    sampler=valid_sampler, num_workers=2)

IF there memory issue ocurres just reduce batch_size.

How can I pick a subset of a dataset in Pytorch?

1 Answers1