Performing data augmentation for classification task is easy as most transform do not change the ground truth label of the image.
However in the case of object localization:
- The position of the bounding box is relative to the crop that has been taken.
- There can be the case that the bounding box is only partially in the crop window, do we perform some sort of clipping in this case.
- There will also be the case that the object bounding box are not included in the crop, do we discard these examples during training.
I am unable to understand how such cases are handled in object localization. Most papers suggest the use of Multi-Scale training but dont address these issues.