Is weight initialization different for dense and convolutional layers?

Question

In a dense layer, one should initialize the weights according to some rule of thumb. For example, with RELU, the weights should come from a normal distribution and should be rescaled by 2/n where n is the number of inputs to the layer (according to Andrew Ng).

Does the same hold for convolutional layers? What is the right way to initialize weights (and biases) in a convolutional layer?

score 4 · Accepted Answer · answered Jan 07 '18 at 13:12

A common initializer for the sigmoid-based networks is Xavier initializer (a.k.a. Glorot initializer), named after Xavier Glorot, one of the authors of "Understanding the difficulty of training deep feedforward neural networks" paper. The formula takes into account not only the number of incoming connections, but also outcoming as well. The authors prove that with this initialization, activations distribution is approximately normal, which helps gradient flow in the backward pass.

For relu-based networks, a better initializer is He initializer from "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification" by Kaiming He at al., which prove the same properties for relu activation.

Dense and convolutional layer aren't that different in this case, but it's important to remember that kernel weights are shared across the input image and batch, so the number of incoming connections depends on several parameters, incluing kernel size and striding, and might not be easy to calculate by hand.

In tensorflow, He initialization is implemented in variance_scaling_initializer() function (which is, in fact, a more general initializer, but by default performs He initialization), while Xavier initializer is logically xavier_initializer().

See also this discussion on CrossValidated.

Thank you, `variance_scaling_initializer()` in `tflearn` is working very well. — Ziofil, Jan 08 '18 at 10:21
@Maxim. Does initialization of weights of dense layer change the end result? — Akhilesh, Jan 30 '18 at 14:16
@Akhilesh Yes, it is possible for a NN not to converge just due to initialization — Maxim, Jan 30 '18 at 15:21

Is weight initialization different for dense and convolutional layers?

1 Answers1