I am reading the faster-rcnn and ssd code for object detection. The prediction layer use the 3x3 filter to predict box position and class label.
Why not use 2x2 filter or 4x4 filter or 5x5 filter to predict them?
I am reading the faster-rcnn and ssd code for object detection. The prediction layer use the 3x3 filter to predict box position and class label.
Why not use 2x2 filter or 4x4 filter or 5x5 filter to predict them?
This is simply a choice of a hyper-parameter. Such choices can be made by cross validation of hyper-parameter search, meaning training a few models with different choices of a hyper-parameter, and seeing who got the best performance on the validation set. Specifically for 3x3 convolution, this has been made popular since the VGG paper which suggested that stacking many 3x3 convolutions (which is considered a small kernel) can give good performance.