I want to implement Yolo v1 but I have some question regarding the algorithm.
I understand the fact that in YOLO, we divide the image per cell (7x7) and we predict a fixed number of bounding boxes (2 by default in the paper, with 4 coordinates : x, y , w, h), a confidence score and we predict also classes score for each cell. During the testing step, we can use NMS algorithm so as to remove
multiple detection of an object.
1) When do we divide the image into a grid ? In fact, when I read the paper they mentionned to divide the image, but when I look at the architecture of the network it seems that we have two part : the convolutional layers and the FC layers. Does that mean the network do it "naturally" with the bounding boxes output ? The size of the grid 7x7 is it specific to the convolution part use it the paper ? If we use for example VGG does it change the size of the grid ?
EDIT : It seems that the grid is divided "virtually" thanks to the output for our network.
2) 2 bounding boxes are used for each cell. But in a cell, we can predict only one object. Why do we use two bounding boxes ?
At training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth. This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall.
3) I do not really get this quote. In fact, it is said that there is one bounding boxe for each object in the image. But the bounding boxe is limited to the cell, so how does YOLO work when the object is bigger than one cell ?
4) Regarding the output layer, it is said they use a linear activation function, but does it use a max value equal to 1? because they said they normalize the coordinates between 0 and 1 (and i think it is the same for the confidence and class prediction).