0

I am a beginner in Convolutional DL. I saw the following architecture in paper Simultaneous Feature Learning and Hash Coding with Deep Neural Networks: For images of size 256*256, |type|filter size/stride|output size|
|:-|:-|:-|
|convolution|11*11 / 4|96 * 54 * 54|
|convolution|1*1 / 1|96 * 54 * 54|
|max pool|3*3 / 2|96 * 27 * 27|

I do not understand the output size of the first 2D convolution: 96*54*54. 96 seems fine as the number of filters is 96. But, if we apply the following formula for the output size: size = [(W−K+2P)/S]+1 = [(256 - 11 + 2*0)/4] + 1 = 62.25 ~ 62. I have assumed the padding, P to be 0 as it is not mentioned in the paper anywhere. Keras Conv2D API produces the same 96*62*62 size output. Then, why paper points to 96*54*54? What am I missing?

Shuvam Shah
  • 357
  • 2
  • 6
  • 13

1 Answers1

1

Well, it reminded me AlexNet paper where there was a similar mistake. Your calculation is correct. I think they mistakenly write 256x256 instead of 224x224, in which case the calculation for the input layer is,

(224-11+2*0)/4 + 1 = 54.25 ~ 54

It's highly possible that authors mistakenly wrote 256x256 instead of the real architecture input size being 224x224 (that was the case in AlexNet also), or the other less possible option is they wrote 256x256 which was the real architecture input size, but do the calculations for 224x224. The latter is ignorable as I think it is a very silly mistake and I don't think that's even an option.

Thus, I believe the true input size was 224x224 instead of 256x256.

null
  • 1,944
  • 1
  • 14
  • 24