Stuck understanding ResNet's Identity block and Convolutional blocks

Question

I'm learning Residual Networks (ResNet50) from Andrew Ng coursera lectures. I understand that one of the main reasons why ResNets work is that they can learn identity function and that's why adding more and more layers in network does not hurt the performance of the network.

Now as described in lectures, there are two type of blocks are used in ResNets: 1) Identity block and Convolutional block.

Identity Block is used when there is no change in input and output dimensions. Convolutional block is almost same as identity block but there is a convolutional layer in short-cut path to just change the dimension such that the dimension of input and output matches.

Here is identity block:

and here is convolutional block:

Now in implementation of convolutional block (2nd image), First block (i.e. conv2d --> BatchNorm --> ReLu is implemented with 1x1 convolution and stride > 1.

# First component of main path 
X = Conv2D(F1, (1, 1), strides = (s,s), name = conv_name_base + '2a', padding = 'valid', kernel_initializer = glorot_uniform(seed=0))(X)
X = BatchNormalization(axis = 3, name = bn_name_base + '2a')(X)
X = Activation('relu')(X)

I don't understand the reason behind keeping stride > 1 with window size 1. Isn't it just data loss? We are just considering alternate pixels in this case.

What should be the possible reason for such hyperparameter selection? Any intuitive explanation will help! Thanks.

Manoj Mohan · Answer 1 · 2019-10-03T07:09:40.733

I don't understand the reason behind keeping stride > 1 with window size 1. Isn't it just data loss?

Please refer the section on Deeper Bottleneck Architectures in the resnet paper. Also, Figure 5. https://arxiv.org/pdf/1512.03385.pdf

1 x 1 convolutions are typically used to increase or decrease the dimensionality along the filter dimension. So, in the bottleneck architecture the first 1 x 1 layer reduces the dimensions so that the 3 x 3 layer needs to handle smaller input/output dimensions. Then the final 1 x 1 layer increases the filter dimensions again.

It's done to save on computation/training time.

From the paper,

"Because of concerns on the training time that we can afford, we modify the building block as a bottleneck design".

joaomcouto · Answer 2 · 2019-10-03T00:08:56.823

I believe you might have answered your own question. The convolutional block is used whenever you need to change the dimension in order for the output and input dimensions to match. That being said, how do you change the dimension of a certain volume using convolutions? Well, you change the stride.

For any given convolution operation, assuming a square input, the dimension of the output volume can be obtained through the formula (n+2p-f)/s +1, where n is the input dimension, p is your zero-padding, f the filter dimension and s is the stride. By increasing the stride you're effectively reducing the dimension of your shortcut's output volume, and thus, it can be used in such a way as to make sure that the dimensions of your shortcut and lower paths will match in order for the final sum to be performed.

Why is it >1 then? Well, if you didn't need a stride larger than one, you wouldn't be needing a dimension alteration in the first place and therefore would be using the identity block instead.

Stuck understanding ResNet's Identity block and Convolutional blocks

2 Answers2