I'll give you an example of how to use dataloaders and will explain the steps:
Dataloaders are iterables over the dataset. So when you iterate over it, it will return B randomly from the dataset collected samples (including the data-sample and the target/label), where B is the batch-size.
To create such a dataloader you will first need a class which inherits from the Dataset Pytorch class. There is a standard implementation of this class in pytorch which should be TensorDataset
. But the standard way is to create an own one. Here is an example for image classification:
import torch
from PIL import Image
class YourImageDataset(torch.utils.data.Dataset):
def __init__(self, image_folder):
self.image_folder = image_folder
self.images = os.listdir(image_folder)
# get sample
def __getitem__(self, idx):
image_file = self.images[idx]
image = Image.open((self.image_folder + image_file))
image = np.array(image)
# normalize image
image = image / 255
# convert to tensor
image = torch.Tensor(image).reshape(3, 512, 512)
# get the label, in this case the label was noted in the name of the image file, ie: 1_image_28457.png where 1 is the label and the number at the end is just the id or something
target = int(image_file.split("_")[0])
target = torch.Tensor(target)
return image, target
def __len__(self):
return len(self.images)
To get an example image you can call the class and pass some random index into the getitem function. It will then return the tensor of the image matrix and the tensor of the label at that index. For example:
dataset = YourImageDataset("/path/to/image/folder")
data, sample = dataset.__getitem__(0) # get data at index 0
Alright, so now you have created the class which preprocesses and returns ONE sample and its label. Now we have to create the datalaoder, which "wraps" around this class and then can return whole batches of samples from your dataset class.
Lets create three dataloaders, one which iterates over the train set, one for the test set and one for the validation set:
dataset = YourImageDataset("/path/to/image/folder")
# lets split the dataset into three parts (train 70%, test 15%, validation 15%)
test_size = 0.15
val_size = 0.15
test_amount, val_amount = int(dataset.__len__() * test_size), int(dataset.__len__() * val_size)
# this function will automatically randomly split your dataset but you could also implement the split yourself
train_set, val_set, test_set = torch.utils.data.random_split(dataset, [
(dataset.__len__() - (test_amount + val_amount)),
test_amount,
val_amount
])
# B is your batch-size, ie. 128
train_dataloader = torch.utils.data.DataLoader(
train_set,
batch_size=B,
shuffle=True,
)
val_dataloader = torch.utils.data.DataLoader(
val_set,
batch_size=B,
shuffle=True,
)
test_dataloader = torch.utils.data.DataLoader(
test_set,
batch_size=B,
shuffle=True,
)
Now you have created your dataloaders and are ready to train!
For example like this:
for epoch in range(epochs):
for images, targets in train_dataloder:
# now 'images' is a batch containing B samples
# and 'targets' is a batch containing B targets (of the images in 'images' with the same index
optimizer.zero_grad()
images, targets = images.cuda(), targets.cuda()
predictions = model.train()(images)
. . .
Normally you would create an own file for the "YourImageDataset" class and then import to the file in which you want to create the dataloaders.
I hope I could make clear what the role of the dataloader and the Dataset class is and how to use them!
I don't know much about iter-style datasets but from what I understood: The method I showed you above, is the map-style. You use that, if your dataset is stored in a .csv, .json or whatever kind of file. So you can iterate through all rows or entries of the dataset. Iter-style will take you dataset or a part of the dataset and will convert in to an iterable. For example, if your dataset is a list, this is what an iterable of the list would look like:
dataset = [1,2,3,4]
dataset = iter(dataset)
print(next(a))
print(next(a))
print(next(a))
print(next(a))
# output:
# >>> 1
# >>> 2
# >>> 3
# >>> 4
So the next
will give you the next item of the list. Using this together with a Pytorch Dataloader is probably more efficient and faster. Normally the map-dataloader is fast enough and common to use, but the documentation supposed that when you are loading data-batches from a database (which can be slower) then iter-style dataset would be more efficient.
This explanation of iter-style is a bit vague but I hope it makes you understand what I understood. I would recommend you to use the map-style first, as I explained it in my original answer.