1

I am trying to understand how the MNIST example in the Matconvnet is designed. It looks like they are using a LeNet variation, but since I did not use Matconvnet before, I am having difficulties how the connection between the last convolutional layer and first fully connected layer has been established:

net.layers = {} ;
net.layers{end+1} = struct('type', 'conv', ...
                       'weights', {{f*randn(5,5,1,20, 'single'), zeros(1, 20, 'single')}}, ...
                       'stride', 1, ...
                       'pad', 0) ;
net.layers{end+1} = struct('type', 'pool', ...
                       'method', 'max', ...
                       'pool', [2 2], ...
                       'stride', 2, ...
                       'pad', 0) ;
net.layers{end+1} = struct('type', 'conv', ...
                       'weights', {{f*randn(5,5,20,50, 'single'),zeros(1,50,'single')}}, ...
                       'stride', 1, ...
                       'pad', 0) ;
net.layers{end+1} = struct('type', 'pool', ...
                       'method', 'max', ...
                       'pool', [2 2], ...
                       'stride', 2, ...
                       'pad', 0) ;
net.layers{end+1} = struct('type', 'conv', ...
                       'weights', {{f*randn(4,4,50,500, 'single'),  zeros(1,500,'single')}}, ...
                       'stride', 1, ...
                       'pad', 0) ;
net.layers{end+1} = struct('type', 'relu') ;
net.layers{end+1} = struct('type', 'conv', ...
                       'weights', {{f*randn(1,1,500,10, 'single'), zeros(1,10,'single')}}, ...
                       'stride', 1, ...
                       'pad', 0) ;
net.layers{end+1} = struct('type', 'softmaxloss') ;

Usually, in libraries like Tensorflow and MxNet, the last convolutional layer is flattened and then connected to the fully connected one. Here, as far as I understand they interpret the first fully connected layer, with the weights {{f*randn(4,4,50,500, 'single'), zeros(1,500,'single')}} as a fully connected layer, but this layer still gives a three dimensional activation map as its result. I don't see how the "flattening" happens here. I need help on how the convolutional layer-fully connected layer connection is established here.

Ufuk Can Bicici
  • 3,589
  • 4
  • 28
  • 57

1 Answers1

1

As far as I know, you should only substitute the fully connected layer with a convolutional layer which has filters with width and height equal to the width and height of the input. And in fact, you don't need to flatten the data before fully connected layer in the Matconvnet (a flat data has got 1x1xDxN shape). In your case, using a kernel with the same spatial size of the input, i.e. 4x4, would operate as FC layers and its output would be 1 x 1 x 500 x B. (B stands for the fourth dimension in the input)

Updated: The architecture of the network and its outputs are visualized here to comprehend the operations flow.

Hossein Kashiani
  • 330
  • 1
  • 6
  • 18
  • Then there is a confusion about what `{{f*randn(4,4,50,500, 'single'), zeros(1,500,'single')}}` means. The above code processes MNIST images, which are of 28x28 pixels size. According to my understanding, the first layer takes a 28x28 image, filters it with 20 of 5x5 kernels. The output is 20 28x28 sized response maps. Then max pooling with 2x2 kernel and stride of 2 is applied. We now have 20 14x14 sized response maps. Then the second conv layer applies 50 5x5 kernels to its 20 inputs. We have 50 14x14 response maps. Again a max pool and then we have 50 7x7 response maps. – Ufuk Can Bicici Mar 26 '18 at 08:16
  • (continued) `{{f*randn(4,4,50,500, 'single'), zeros(1,500,'single')}}` is applied to this last layer. Each of the 50 7x7 inputs are convolved with 500 4x4 kernels. Since no pooling is applied the response dimensions do not change and we have 500 7x7 inputs for the next layer.Then for each input it applies 10 1x1 filters and we are left with 10 7x7 outputs now.I can't imageine how softmax is applied to 7x7 inputs. But I don't know how these tuples in the layer constructions are interpreted in matconvnet, since its horrendously documented. Please correct me if the correct flow is not like that. – Ufuk Can Bicici Mar 26 '18 at 08:26
  • You actually made a mistake at the first convolutional layer. let's visualize the network for U. – Hossein Kashiani Mar 26 '18 at 09:12
  • Thanks for the clarification with the network image. Why does the response maps' size gets reduced by 2 after each convolutional layer? I am more used to MxNet and Tensorflow type of Conv layers and did not expect such a behavior. – Ufuk Can Bicici Mar 27 '18 at 14:44
  • As you know, the spatial output of conv layer is computed as `(W-F+2P)/S+1`, so for instance, for the first layer with`padding=0` and `stride=1` we have the output as : `(28-5+0)/1+1=24`, then each pooling layers acts as the same way but with different configuration. for instance the first pooling with `padding=0` and `stride=2` acts as :`(24-2)/2+1=12` – Hossein Kashiani Mar 27 '18 at 16:38