24

I'm looking at InceptionV3 (GoogLeNet) architecture and cannot understand why do we need conv1x1 layers?

I know how convolution works, but I see a profit with patch size > 1.

Verych
  • 431
  • 1
  • 5
  • 12

2 Answers2

48

You can think about 1x1xD convolution as a dimensionality reduction technique when it's placed somewhere into a network.

If you have an input volume of 100x100x512 and you convolve it with a set of D filters each one with size 1x1x512 you reduce the number of features from 512 to D. The output volume is, therefore, 100x100xD.

As you can see this (1x1x512)xD convolution is mathematically equivalent to a fully connected layer. The main difference is that whilst FC layer requires the input to have a fixed size, the convolutional layer can accept in input every volume with spatial extent greater or equal than 100x100.

A 1x1xD convolution can substitute any fully connected layer because of this equivalence.

In addition, 1x1xD convolutions not only reduce the features in input to the next layer, but also introduces new parameters and new non-linearity into the network that will help to increase model accuracy.

When the 1x1xD convolution is placed at the end of a classification network, it acts exactly as a FC layer, but instead of thinking about it as a dimensionality reduction technique it's more intuitive to think about it as a layer that will output a tensor with shape WxHxnum_classes.

The spatial extent of the output tensor (identified by W and H) is dynamic and is determined by the locations of the input image that the network analyzed.

If the network has been defined with an input of 200x200x3 and we give it in input an image with this size, the output will be a map with W = H = 1 and depth = num_classes. But, if the input image have a spatial extent greater than 200x200 than the convolutional network will analyze different locations of the input image (just like a standard convolution does) and will produce a tensor with W > 1 and H > 1. This is not possibile with a FC layer that constrains the network to accept fixed size input and produce fixed size output.

MD004
  • 581
  • 1
  • 7
  • 19
nessuno
  • 26,493
  • 5
  • 83
  • 74
  • 3
    so, such conv operation like 1x1x1 is absolutely unuseful, correct? – Verych Sep 07 '16 at 13:23
  • 5
    There's no such thing as a `1x1x1` convolution alone, a convolution is always related to the depth of the input volume. In general, the architecture of a convolution of this kind is : `WxHxD` -> `(1x1xD)x1` -> `WxHx1`. And you combined `D` input features into 1 feature. But if the input volume have `D=1`, so you're combining 1 feature into another feature. You're simply passing the feature value to a neuron that will map this single value into a different space. It could be useful in some cases I guess – nessuno Sep 07 '16 at 13:55
  • 2
    @Verych You're correct. Mathematically you could define a 1x1x1 convolution, and it would indeed be useless (output would equal the original input). For some reason, in machine learning people often assume there's a 3rd dimension which is number of channels (or number of filters). So, implicitly, "1x1 convolution" actually refers to "1x1xD convolution". – MD004 Sep 29 '17 at 19:44
  • Clarifying link: https://www.quora.com/Is-a-fully-connected-neural-network-conceptually-similar-to-a-1x1-convolutional-neural-network – fast-reflexes Sep 27 '18 at 06:55
  • The output of fully connected network is a vector,but the NiN's output is still a matrix,why is 'mathematically equivalent to a fully connected layer'?I googled a lot,but can't understand this equivalence.Has any intuitive explanation about it? – Alex Luya Sep 14 '19 at 09:26
  • A more relevant question could be what happens with a 1*1 conv where num of i/p channels = num of o/p channels? It seems un-useful, yet I have seen it being done. Possibly the activations bring in non-linearity and thus help? – Allohvk Jul 01 '21 at 05:28
  • When the number of input channels (D) is equal to the number of the output channels (D) the operation is still useful. You haven't reduced the dimensionality, but you learned how to extract D feature maps from input with depth D, only looking at single pixels. So you extracted a new representation of the depth information (e.g. you combined somehow the RGB components of the input pixel into a new 3D vector with other meaningful (hopefully) info for solving the tasks) – nessuno Jul 01 '21 at 08:57
7

A 1x1 convolution simply maps in input pixel to an output pixel, not looking at anything around itself. It is often used to reduce the number of depth channels, since it is often very slow to multiply volumes with extremely large depths.

input (256 depth) -> 1x1 convolution (64 depth) -> 4x4 convolution (256 depth)

input (256 depth) -> 4x4 convolution (256 depth)

The bottom one is about ~3.7x slower.

Theoretically the neural network can 'choose' which input 'colors' to look at using this, instead of brute force multiplying everything.

Free Debreuil
  • 170
  • 1
  • 3