I'm learning Residual Networks (ResNet50) from Andrew Ng coursera lectures. I understand that one of the main reasons why ResNets work is that they can learn identity function and that's why adding more and more layers in network does not hurt the performance of the network.
Now as described in lectures, there are two type of blocks are used in ResNets: 1) Identity block and Convolutional block.
Identity Block is used when there is no change in input and output dimensions. Convolutional block is almost same as identity block but there is a convolutional layer in short-cut
path to just change the dimension such that the dimension of input and output matches.
Here is identity block:
and here is convolutional block:
Now in implementation of convolutional block (2nd image), First block (i.e. conv2d --> BatchNorm --> ReLu
is implemented with 1x1
convolution and stride > 1.
# First component of main path
X = Conv2D(F1, (1, 1), strides = (s,s), name = conv_name_base + '2a', padding = 'valid', kernel_initializer = glorot_uniform(seed=0))(X)
X = BatchNormalization(axis = 3, name = bn_name_base + '2a')(X)
X = Activation('relu')(X)
I don't understand the reason behind keeping stride > 1 with window size 1. Isn't it just data loss? We are just considering alternate pixels in this case.
What should be the possible reason for such hyperparameter selection? Any intuitive explanation will help! Thanks.