0

enter image description here

The above figure contains the mobilenet architecture, In the very first row the input size is mentioned as 224x224x3 and filter shape of 3x3x3x32 and a stride of 2. If we apply the formula for out_size = ((input_size - filter_size + 2*padding)/stride)+1,(padding = 0) we get out_size as (224-3+2(0))/2 + 1 = 111.5 , but in the second row the input size is mentioned as 112x112x32. I'm new to these concecpts, can anyone explain me where i am going wrong?

  • 2
    I’m voting to close this question because it is not about programming as defined in the [help] but about ML theory and/or methodology - please see the intro and NOTE in the `deep-learning` [tag info](https://stackoverflow.com/tags/deep-learning/info). – desertnaut Jun 29 '21 at 11:31

1 Answers1

1

You are not wrong, without padding the output shape of the first 2D convolution layer would not be adequate.

To implement it, you must set a padding to one side of the left-right dimension, and a padding to one side of the top-bottom dimension. That way you'll have an input shape of 225x225x3, which will yield the correct output shape after 2Dconvolution of stride 2 and kernel 3x3.

With PyTorch, you can simply set padding=1 in

torch.nn.Conv2d(in_channels=3, out_channels=32, kernel_size=(3,3), stride=2, padding=1)

It will understand that padding on both side of each dimension will not be possible, and return an output of shape (112, 112, 32).

  • "a padding to one side of the and top-bottom dimension. That way you'll have an input shape of 225x225x3", it's confusing at this point, If I add one padding to the left-right and one padding to the top-bottom, the number of pixels will become 226x226x3 right? – Vignesh Kathirkamar Jul 03 '21 at 08:16
  • 1
    If you go through the [PyTorch source code](https://github.com/pytorch/pytorch/blob/c780610f2d8358297cb4e4460692d496e124d64d/aten/src/ATen/native/Convolution.cpp#L411), you can find that "padding=1" will concatenate zeros on both axes, to a shape of 226x226x3. BUT, the convolutions, going from left to right and from top to bottom, will stop on each axes as soon as the last non-padded value is reached. That is, in the MobileNet example the last row and last column of zeros is not used: so the convolutions operate on 225x225x3 images. See Figure 2.7 [here](https://arxiv.org/pdf/1603.07285v1.pdf) – Richard Faucheron Jul 05 '21 at 07:59