creating a train and a test dataloader

Question

I have actually a directory RealPhotos containing 17000 jpg photos. I would be interested in creating a train dataloader and a test dataloader

ls RealPhotos/
2007_000027.jpg  2008_007119.jpg  2010_001501.jpg  2011_002987.jpg
2007_000032.jpg  2008_007120.jpg  2010_001502.jpg  2011_002988.jpg
2007_000033.jpg  2008_007123.jpg  2010_001503.jpg  2011_002992.jpg
2007_000039.jpg  2008_007124.jpg  2010_001505.jpg  2011_002993.jpg
2007_000042.jpg  2008_007129.jpg  2010_001511.jpg  2011_002994.jpg
2007_000061.jpg  2008_007130.jpg  2010_001514.jpg  2011_002996.jpg
2007_000063.jpg  2008_007131.jpg  2010_001515.jpg  2011_002997.jpg
2007_000068.jpg  2008_007133.jpg  2010_001516.jpg  2011_002999.jpg
2007_000121.jpg  2008_007134.jpg  2010_001518.jpg  2011_003002.jpg
2007_000123.jpg  2008_007138.jpg  2010_001520.jpg  2011_003003.jpg
...

I know I can subclassing TensorDataset to make it compatible with unlabeled data with

class UnlabeledTensorDataset(TensorDataset):
    """Dataset wrapping unlabeled data tensors.
    
    Each sample will be retrieved by indexing tensors along the first
    dimension.
    
    Arguments:
        data_tensor (Tensor): contains sample data.
    """
    def __init__(self, data_tensor):
        self.data_tensor = data_tensor
    
    def __getitem__(self, index):
        return self.data_tensor[index]

And something along these lines for training the autoencoder

X_train     = rnd.random((300,100))
train       = UnlabeledTensorDataset(torch.from_numpy(X_train).float())
train_loader= data_utils.DataLoader(train, batch_size=1)

for epoch in range(50):
    for batch in train_loader:
        data = Variable(batch)
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, data)

Do you want `ArtificialPhotos` to be in the train set and `RealPhotos` to be in the test set? — Ivan, Dec 01 '20 at 21:18
@Ivan Sorry, during the time you built an answer I modify my question. Don't remove what you have just written in your answer, but edit it instead. — Alex, Dec 01 '20 at 21:22
So basically you want to split your data and create two loaders? — Ivan, Dec 01 '20 at 21:24
Yes, a portion of the photos will be use to test the autoencoder and the other portion will be used to train the autoencoder. — Alex, Dec 01 '20 at 21:26
@Ivan Just out of curiosity, is there anything you need to change in your answer? — Alex, Dec 01 '20 at 21:39
Yes, but I need to create a dataset and a dataloader. It seems you have deleted important codes — Alex, Dec 01 '20 at 21:41

score 1 · Accepted Answer · edited Dec 29 '22 at 12:24

You first need to define a Dataset (torch.utils.data.Dataset) then you can use DataLoader on it. There is no difference between your train and test dataset, you can define a generic dataset that will look into a particular directory and map each index to a unique file.

class MyDataset(Dataset):
    def __init__(self, directory):
        self.files = os.listdir(directory)

    def __getitem__(self, index):
        img = Image.open(self.files[index]).convert('RGB')
        return T.ToTensor()(img)

Where T refers to torchvision.transform and Image is imported from PIL.

You can then instanciate a dataset with

data_set = MyDataset('./RealPhotos')

From there you can use torch.utils.data.random_split to perform the split:

train_len = int(len(data_set)*0.7)      
train_set, test_set = random_split(data_set, [train_len, len(data_set)-train_len])

Then use torch.utils.data.DataLoader as you did:

train_loader = DataLoader(train_set, batch_size=1, shuffle=True)
test_loader = DataLoader(test_set, batch_size=16, shuffle=False)

I prefer the first answer you give me, because you gave a way to get access to the file in the directory. Combining that answer with the first one would be awesome — Alex, Dec 01 '20 at 21:48
I would have liked to upvote the question, but my score is too low. Thanks for the answer! — Alex, Dec 01 '20 at 21:53

creating a train and a test dataloader

1 Answers1