2

We know that the object detection framework like faster-rcnn and mask-rcnn has an roi pooling layer or roi align layer. But why ssd and yolo framework has no such layers?

tidy
  • 4,747
  • 9
  • 49
  • 89

1 Answers1

8

First of all we should understand what is the purpose of roi pooling : to have fixed size feature representation from proposal regions on the feature maps. Because the proposed regions could come as in various sizes, if we directly use the features from the regions, they are in different shapes and therefore cannot be fed to fully-connected layers for prediction. (As we already knew fully-connected layers require fixed shape inputs). For further reading, here is a nice answer.

So we understood that roi pooling essentially requires two inputs, proposed regions and feature maps. As is clearly described in the following figure figure.

So why don't YOLO and SSD use roi pooling? Simply because they don't use region proposals! They are designed inherently different from models like R-CNN, Fast R-CNN, Faster R-CNN, in fact YOLO and SSD are categoried as one-stage detectors while r-cnn series (R-CNN, Fast R-CNN, Faster R-CNN) are called two-stage detectors simply because they propose regions first and then perform classification and regression.

For one-stage detecors, they perform predictions (classification and regression )directly from feature maps. Their method is to divide images in grids and each grid will predict a fixed amount of bounding boxes with confidence scores and class scores. The original YOLO used a single scale feature map while SSD used multi-scale feature maps, as clearly shown in the following fig enter image description here

We can see with YOLO and SSD , the final output is a fixed shaped tensor. Therefore they behave very similiar to problems like linear regression, hence they are called one-stage detectors.

Danny Fang
  • 3,843
  • 1
  • 19
  • 25