Data augmentation on-the-fly for semantic segmentation, Is my python layer definition correct?

Question

I am not expert in caffe and Python, but I am trying to learn step by step. I am a little bit confused, so I would really appreciate if experts have a look on my questions.

I am working on image segmentation. I am trying to do on-the-fly data augmentation by adding python layers. For my dataset I would like to do translation of (+10,-10) in both x-axis and y-axis (4 translations in addition), adding Gaussian noise, and horizontal flipping.

My questions are:

How does caffe synchronize the image with label? For example, if I am sending an image by data layer to the network and on the side, label is sent to the SoftmaxWithLoss (for example). I have drawn (manually) a schematic view of the augmentation and normal flow of data, and I am not sure how much my understanding is correct!

As can be seen in the figure, for translation we have to translate image and ground truth in a synchronization manner (or for flipping, we have to flip label as well); for example, if I am shifting image by -10 and -10 pixels in x-axis and y-axis respectively, the ground truth image also needs to be relocated correspondingly. How this can be done in caffe Python layer. Is my understanding correct (based on the figure)? I have written the python layer as follows:

import caffe
import numpy as np
from skimage import transform as tf
from skimage.transform import AffineTransform

class ShiftLayer(caffe.Layer):

    def setup(self,bottom,top):
        assert len(bottom)==2,  #requires two inputs bottom(1:image, 2:label)
        assert len(top)==2      #requires two layer top

    def reshape(self,bottom,top):
        top[0].reshape(*bottom[0].data.shape)   #HOW CAN WE KNOW LABEL or DATA is GOING TO "bottom[0]" or "bottom[1]"?????
        top[1].reshape(*bottom[1].data.shape)

    def forward(self,bottom,top):
        x_trans=-10 
        y_trans=-10
        top[0].data[...]=tf.warp(bottom[0].data, AffineTransform(translation=(x_trans,y_trans)))
        top[1].data[...]=tf.warp(bottom[1].data, AffineTransform(translation=(x_trans,y_trans)))


    def backward(self,top,propagate_down,bottom):
        pass

And this the layer definition:

layer {
  name: "shift_layer"
  type: "Python"
  bottom: "data"
  bottom: "label"
  top: "data"
  top: "label"
  include {
  phase: TRAIN
  }
  python_param {
    module: "myshift_layer"
    layer: "ShiftLayer"
  }
}

If I am adding other augmentation techniques to the network should I write separate modules for each of them? or can I write one single python layer including many bottoms and the corresponding tops? If yes, How can I know which top is related to which bottom?
In the case of Gaussian noise addition, we do have the same label as input image, how is the layer definition for this one?

score 1 · Accepted Answer · answered Dec 18 '17 at 09:36

In general youir understanding looks to be correct. But:

Caffe blobs (top, bottom) stores images as (channels * rows * columns) form unlike usual form (rows * columns * channels). It does not make difference in case of 1-channel image (like labels) but in case of color images it does. I have doubts if tf.warp does work correctly in this case.
I see no reason to make separate layers for all kinds of augmentation (shift, flip etc.). There is no problem to do all them in one python layer. But I don't understand you idea to have many bottoms and tops in this case. Moreover, the python layer that you have shown acatually makes no augmentation, because it simply produces set of similary shifted images in place of original ones. It will not improve traing process. Commonly used approach to on-the-fly augmentation is a transformation, that does not influence net shape, but puts randomly(!) transformed data in place of original ones. So then the net processes the same input image at different traning epochs, it actually handles different images, produced from this input image by random transformation. So you have to complete youir example with random choice of x_trans,y_trans. In common case you also can add random flip and random Gaussian noice etc. This transformations can be applyed simultaniously or you can randomly choose one of them. Anyway the layer must have only 1 pair of data+label as bottoms and as tops.

Also applying transforms may cause blurring on edges of labels. In case of multiclass labels it may become problem. — Andrey Smorodov, Dec 21 '17 at 10:52

Data augmentation on-the-fly for semantic segmentation, Is my python layer definition correct?

1 Answers1