How to add data augmentation to an object detection model (DETR) HuggingFace

Question

I am trying to follow the Hugging Face DETR Tutorial for fine-tuning in my own dataset. Here they explain that some data augmentation techniques are applied.

Note regarding data augmentation

DETR actually uses several image augmentations during training. One of them is scale augmentation: they set the min_size randomly to be one of [480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800] as can be seen here. 
However, we are not going to add any of the augmentations that are used in the original implementation during training. It works fine without them

However, I want to add more, such as zoom, hsv variations, etc. The dataset class definition is as follows:

import torchvision
import os

class CocoDetection(torchvision.datasets.CocoDetection):
    def __init__(self, img_folder, processor, train=True):
        ann_file = os.path.join(img_folder, "custom_train.json" if train else "custom_val.json")
        super(CocoDetection, self).__init__(img_folder, ann_file)
        self.processor = processor

    def __getitem__(self, idx):
        # read in PIL image and target in COCO 
        img, target = super(CocoDetection, self).__getitem__(idx)
        
        # preprocess image and target (converting target to DETR format, resizing + normalization of both image and target)
        image_id = self.ids[idx]
        target = {'image_id': image_id, 'annotations': target}
        encoding = self.processor(images=img, annotations=target, return_tensors="pt")
        pixel_values = encoding["pixel_values"].squeeze() # remove batch dimension
        target = encoding["labels"][0] # remove batch dimension

        return pixel_values, target

The preprocessor_config.json read by HuggingFace to load the model includes:

{
  "do_normalize": true,
  "do_resize": true,
  "feature_extractor_type": "DetrFeatureExtractor",
  "format": "coco_detection",
  "image_mean": [
    0.485,
    0.456,
    0.406
  ],
  "image_std": [
    0.229,
    0.224,
    0.225
  ],
  "max_size": 1333,
  "size": 800
}

Which already includes the mean and variation for augmentation. As a new user in HuggingFace, I do not know if repeating that in the typical torchvision.transforms.compose([]) would affect on anything, or should we discard adding it as it is already done here in the json file? I didn't find anything similar on the Internet. It is probably a dummy question but I don't know how and where to add the augmentations. Could anyone make an example, given my situation? Any help is appreciated. Thank you so much

How to add data augmentation to an object detection model (DETR) HuggingFace

0 Answers0