I was going through slides for mask RCNN given here, but wasn't able to compute the feature map after applying the ROI Align, as given in image below, The paper and slides mention to use Bi-linear Interpolation, but i can't figure how to do that in given image. Thanks
2 Answers
Once you have placed the 4 dots inside each pooling cell the value of each dot is determined using bilinear interpolation using the 4 pixels closest to it. Once you have a value for each dot you take either the average or the max of the 4 dots in each pooling cell. You put that value into the corresponding spot inside the output tensor and you are good to go for the forward operation, the backward operation should not be a problem either.
For instance in your image the first red dot is surrounded by 0.85, 0.34, 0.32 and 0.74 value-pixels the resulting value is a function of:
these values
the distances of the red dot to these pixels (their centers)
The closest it is to a pixel the closest its value is to the corresponding pixel value.

- 2,471
- 4
- 29
- 56
-
while considering the second red dot, which is in 0.76, what will be the surrounding pixels? Every pixel is surrounded by 8 pxls, which ones will we choose? – Prakash Vanapalli Jul 19 '18 at 09:20
-
wiki says "bilinear interpolation uses values of only the 4 nearest pixels, located in diagonal directions from a given pixel, in order to find the appropriate color intensity values of that pixel." . So I guess it is not surrounding pixels (which can be even 8 also) but only diagnol pixels. – Prakash Vanapalli Jul 19 '18 at 09:39
-
It depends really on your implementation I guess it is a design choice it seems Girshick went for the neighbors defined by the floor and ceil operator applied to the float coordinates of the dot. – jeandut Jul 19 '18 at 10:55
-
Which defines exactly 4 pixels. – jeandut Jul 19 '18 at 10:56
Also check this implementation
#From Mask R-CNN paper: "We sample four regular locations, so
# that we can evaluate either max or average pooling. In fact,
# interpolating only a single value at each bin center (without
# pooling) is nearly as effective."
#
# Here we use the simplified approach of a single value per bin,
# which is how it's done in tf.crop_and_resize()
# Result: [batch * num_boxes, pool_height, pool_width, channels]

- 2,719
- 2
- 18
- 31