8

Currently I am testing the yolo 9000 model for object detection and in the Paper I understand that the image is splited in 13X13 boxes and in each boxes we calculate P(Object), but How can we calculate that ? how can the model know if there is an object in this boxe or not, please I need help to understand that

I am using tensorflow

Thanks,

Abel Callejo
  • 13,779
  • 10
  • 69
  • 84
Kamel BOUYACOUB
  • 617
  • 3
  • 7
  • 20
  • I am searching for this too. There's no much of an explanation. Also ow they calculate the p(class/object). There is a good explanation in quora session https://www.quora.com/How-do-Multi-Object-detection-with-YOLO-Real-time-CNN-works. – Shamane Siriwardhana Apr 28 '17 at 04:02

4 Answers4

4

They train for the confidence score = P(object) * IOU. For the ground truth box they take P(object)=1 and for rest of the grid pixels the ground truth P(object) is zero. You are training your network to tell you if some object in that grid location i.e. output 0 if not object, output IOU if partial object and output 1 if object is present. So at test time, your model has become capable of telling if there is an object at that location.

Tejus Gupta
  • 165
  • 1
  • 7
0

As they mentioned in the paper(2nd page section 2) confident score is = P(object) * IOU. But in that paragraph they have mentioned that if there's an object then confident score will be IOU otherwise zero. So it's just a guide line.

Shamane Siriwardhana
  • 3,951
  • 6
  • 33
  • 73
  • I understand that, but I don figure out how with just only a small region in the image, and so with a small region of extracted caracteristics, the cell can classify this region ? – Kamel BOUYACOUB May 30 '17 at 08:13
  • I take it as a sliding window operation. Here you divide your original image in to 49 squares. The ground truth is obtained from the original image. But the features to predict will be get by last conv layer which is 7*7 – Shamane Siriwardhana May 30 '17 at 08:36
  • but this is for training, What about the test step, if we give an unknown images, and we pass it to the network, the lower layers will extract the features, and then divide the images into 7X7, but here we don't have the groud truth, so with just small caracteristic how can he generelise an object, which have a size bigger than the cell, please can you give me more details ? – Kamel BOUYACOUB May 30 '17 at 09:00
0

There are 13x13 grid cells, true, but P(object) is calculated for each of 5x13x13 anchor boxes. From the YOLO9000 paper:

When we move to anchor boxes we also decouple the class prediction mechanism from the spatial location and instead predict class and objectness for every anchor box.

I can't comment yet because I'm new here, but if you're wondering about test time, it works kind of like an RPN. At each grid cell, the 5 anchor boxes each predict a bounding box, which can be larger than the grid cell, and then non-maximum suppression is used to pick the top few boxes to do classification on.

P(object) is just a probability, the network doesn't "know" if there is really an object in there or not.

You can also look at the source code for the forward_region_layer method in region_layer.c and trace how the losses are calculated, if you're interested.

SamShady
  • 456
  • 4
  • 5
0

During test time the YOLO network gets the IOU from the default setted value. That is 0.5.