0

I have an Image Captioning dataset, where each sample is composed by an image and a list of captions.

  • Each sample has one or more captions
  • The number of captions can be different for each sample.

Here's a visual example:

image captioning visual example

I am using PyTorch and I have created a custom Dataset and Dataloader to train the models and perform evaluation.

  • For training I randomly choose a caption between the list of available captions, then calculate the model output and the NLL loss between the model's output and the target.
  • For evaluation, I want to calculate the loss between the model and a sampled caption, but also other metrics used in Text Generation tasks, such as BLEU and ROUGE. Those metrics accept multiple references, so I want to pass the list of all the available captions for each sample.

What is the best way to make the Dataset and Dataloader handle both of these cases, i.e provide one randomly chosen label for training and all the labels for multi-reference metrics?

  • I tried adding a flag to the Dataset class, which would be set to true during validation and test. However, since each sample has a varying number of labels, the DataLoader cannot build the batch. One solution could be to directly iterate on the Dataset, but I think there must be a better solution.
from torch.utils.data import Dataset, DataLoader

class MyDataset(Dataset):
    def __init__(self, root, split, image_transform, processor):
        file = pl.Path(root) / '{}.json'.format(split)
        with open(file) as f:
            j = json.load(f)
            self.data = list(j.values())
        self.split = split
        self.image_transform = image_transform
        self.processor = processor

    def __getitem__(self, i):
        image_path = self.data[i]['img_url']
        image = Image.open(image_path).convert('RGB')
        # randomly sample one visual sentence
        labels = self.data[i]['visual_sentences']
        if self.image_transform is not None:
            image = self.image_transform(image)
        encoding = self.processor(images=image, text=random.sample(labels, 1), padding="max_length", return_tensors="pt")
            # remove batch dimension
        encoding = {k: v.squeeze() for k, v in encoding.items()}
        # add all the labels if not in training
        if self.split != 'train':
            encoding['labels'] = labels
        return encoding


class MyDataLoader(BaseDataLoader):
    def __init__(self, data_dir, batch_size, split, shuffle=True, validation_split=0.0, num_workers=1, processor=None):
        transform = transforms.Compose([
            transforms.Resize((224, 224))
        ])
        processor = AutoProcessor.from_pretrained(processor)
        self.data_dir = data_dir
        self.dataset = MyDataset(data_dir, split, image_transform=transform, processor=processor)
        super().__init__(self.dataset, batch_size, shuffle, validation_split, num_workers)

  • Do I have to manually "pad" the lists, e.g by adding empty strings in order to have all lists of equal length (and then remove those empty strings while calculating the metrics, I suppose)? Are there other solutions?
Ciodar
  • 103
  • 1
  • 5

0 Answers0