caffe batchnorm layer affect base_lr?

Question

I saw the following code under caffe framework. The whole code is trying to write caffe train_val.prototxt and solver.prototxt.

# Use different initial learning rate.
if use_batchnorm:
    base_lr = 0.0004
else:
    base_lr = 0.00004

Why is the base learning rate different?

score 0 · Answer 1 · answered Jan 01 '17 at 15:33

You could try if this paper helps. They say that if you don't use normalization, you need to train "more carefully", meaning using a lower learning rate.

Skimming the first pages, I could imagine it works like this:

For some nonlinearities, there's a 'good input value range', and batch norm brings values into that range. High input values are bad and lead to saturation (little slope in the function and "vanishing gradients").

So, if you don't normalize, you need to make smaller steps - a lower learning rate - to avoid 'jumping' into weights that lead to high values within the net. And also be more careful how you initialize weights. I guess if you use ReLus, that's not so much of a problem. But please correct me if someone else has had different experiences with ReLus.

caffe batchnorm layer affect base_lr?

1 Answers1