2

In SegNet, the architecture proposed by authors is shown as follows. enter image description here.

What confuses me is that there are two convolutional layers following with each other in each building block, just as what shown in the figure as 1 and 2. What are the major motivations to place convolution layers this way instead of aggregating them into a single convolutional layer?

user288609
  • 12,465
  • 26
  • 85
  • 127

2 Answers2

2

SegNet uses the 13 convolutional layers from VGG. (2+2+3+3+3)

Check this visualization and the paper for more information.

From the paper:

It is easy to see that a stack of two 3×3 conv. layers (without spatial pooling in between) has an effective receptive field of 5×5 such layers have a 7 × 7 effective receptive field. So what have we gained by using, for instance, a stack of three 3×3 conv. layers instead of a single 7×7 layer? First, we incorporate three non-linear rectification layers instead of a single one, which makes the decision function more discriminative. Second, we decrease the number of parameters: assuming that both the input and the output of a three-layer 3 × 3 convolution stack has C channels, the stack is parametrised by enter image description here weights; at the same time, a single 7 × 7 conv. layer would require enter image description here parameters, i.e. 81% more. This can be seen as imposing a regularisation on the 7 × 7 conv. filters, forcing them to have a decomposition through the 3 × 3 filters (with non-linearity injected in between).

1

If you look at the legend at the bottom of the figure you attached, you'll see that in this SegNet illustration the blue layer stands for "Conv + BatchNormalization + ReLU": That is there is a non-linear activation, "ReLU" between the two linear convolutions.

Regardless of this example, one might want to place two linear units one on top of the other without any non-linearity in order to explicitly control/regularize the rank of the linear operation. See, for example, how to reduce dimensionality of a fully connected layer using SVD trick.

Community
  • 1
  • 1
Shai
  • 111,146
  • 38
  • 238
  • 371
  • thanks for your answer. I am still confusing about this question. Say, the conv layer (marked as 1) generates 512 feature maps, and conv layer (marked as 2) generates 512 feature maps. Why not using single layer to generate 1024 feature maps? – user288609 Feb 01 '17 at 13:48
  • @user288609 it is not equivalent: (a) you have a non-linearity between the layers. (b) if the conv kernel is 3x3 than applying 3x3 twice is like applying 5x5 once (in terms of receptive field). Breaking the linear layers in this way allows you to model more complex structures than just linear. – Shai Feb 01 '17 at 14:30
  • @user288609 the question you ask is very deep, you cannot expect to get a full answer for it in a comment. I'm not sure even a full answer is the right scope. – Shai Feb 01 '17 at 14:31