2

I understood that we need selective search as an external algorithm for generating region of interest proposals in R-CNN, but in Fast R-CNN we can simply take in the entire image, and then passes it to the convolutional network to create a feature map, and then used a single layer of SPP (RoI pooling layer).

On another hand, we used multi-layer SPP in SPP-net. For quick reference & understanding enter image description here

In both slow R-CNN, SPP-net & Fast R-CNN the region of interest(RoIs) was from a proposal method ("selective search", ?? ,?? respectively).

Could anyone explain in detail & cite it what proposal methods explicitly used in the SPP-net & Fast R-CN since, I didn't find it mentioned clearly in the research papers in details?

Anu
  • 3,198
  • 5
  • 28
  • 49

1 Answers1

2

The official github repo showed both SPP-net and Fast R-CNN used the same region proposal method as R-CNN, namely 'selective search':

SPP_net and Fast R-CNN. In SPP_net repo, there is a selective search module for computing region proposals, in fast r-cnn repo, the author specifically mentioned the method for computing object proposals is selective search.

But again, generating region proposals can also use other methods, since R-CNN and Fast R-CNN adopted object proposal methods as external modules independent of the detectors.

Generally speaking, if a method generates more proposals, it can benefit the final detection accuracy but this of course would limit the detection speed. In the Faster R-CNN paper section 2 'Related Work', there is a nice summary of all object proposals generating method.

For the follow up question, namely how to intuitively picture region proposals in the feature map, it can be better illustrated in the following picture (ref): image_ref

In the picture, the red box on the left after convolutional opereation will become the red square in the output volume on the right, and the green box corresponds to the green square, etc. Now imagine the whole 7x7 on the left is the region proposal, then on the output feature map, it is still a region proposal! Of course in reality the image on the left has much more pixels, so there could be many region proposals, and each of these proposals will still look like a region proposal on the output feature map!

Finally in the original SPP_net paper, the author expalins how exactly they performed the transformation of region proposals from the original image to the candidate windows on the feature map. enter image description here

Danny Fang
  • 3,843
  • 1
  • 19
  • 25
  • thanks, one follow up question, I can understand intuitively R-CNN's creating RoIs first on raw pixels(image) and then applying CNN to get the feature maps, what I haven't understood in SPP-net & Fast R-CNN, is that how we are getting the same RoIs as of before by applying Selective Search on feature maps? How the same region proposal algo. or even different methods applied on raw image or feature maps would yield the same representations that help in same or better object detection accuracy? I didn't understand it fully. could you please elaborate that in your post too. – Anu Apr 03 '19 at 17:23
  • thanks for adding an answer, I was able to trace the selective search modules in both fast R-CNN & Spp-net implementations, but I am still unable to digest fully: `How Selective search happened on raw pixels (slow R-CNN) compared to Selective search happened on Feature maps could result in the same/ similar representations?` Since in SPP-net classification happens for each object proposal using a feature vector extracted from the shared feature map contrary to no shared computation since the feature vectors were directly extracted each region proposal by running CNN separately, any suggestion – Anu Apr 04 '19 at 11:03
  • 2
    The selective search on SPP-net also happens on the original raw pixels but the **region proposals will transform to the corresponding ones on the feature maps**, and that is why I added the part to explain why region proposals on the raw image will still look like region proposals on the feature map, and this also explains why selective search can be used as an external module because it only depends on the raw input image. – Danny Fang Apr 04 '19 at 11:07
  • 1
    Now to get a fixed representation, each proposal on the feature map will then pass a pooling layer to get the final representation, usually this representation is fixed length no matter what size the proposal is. – Danny Fang Apr 04 '19 at 11:17
  • that's true and explained by you in the post by applying MaxPool operation. Where I am confused is how the same SS algorithm is `able to output the same object proposals` as on `the original image(3-channel) in case of R-CNN or the feature maps(n-channel & different aspect ratio) SPP-net or Fast R-CNN? moving forward from this point is quite clear to me.! In other words, `if I try to visualize the region proposals in R-CNN it's warped image sections which we can visually see & digest, are they same (how it looks like since they are simply some activations) in case SPP-net or Fast R-CNN? – Anu Apr 04 '19 at 11:32
  • 2
    In all three cases, the SS algorithm will all be applied on the raw images! There is no difference! **The SS algorithm won't be applied to the feature maps**. Only the bounding boxes on the raw image will transform to the ones on the feature map (n-channel & different aspect ratio.) – Danny Fang Apr 04 '19 at 11:39
  • 1
    Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/191234/discussion-between-danyfang-and-anu). – Danny Fang Apr 04 '19 at 11:42