3

I have been using the following piece of code to print the lr_t learning_rate in Adam() optimizer for my trainable_model.

if(np.random.uniform()*100 < 3 and self.training):
    model = self.trainable_model
    _lr    = tf.to_float(model.optimizer.lr, name='ToFloat')
    _decay = tf.to_float(model.optimizer.decay, name='ToFloat')
    _beta1 = tf.to_float(model.optimizer.beta_1, name='ToFloat')
    _beta2 = tf.to_float(model.optimizer.beta_2, name='ToFloat')
    _iterations = tf.to_float(model.optimizer.iterations, name='ToFloat')
    t = K.cast(_iterations, K.floatx()) + 1
    _lr_t = lr * (K.sqrt(1. - K.pow(_beta2, t)) /  (1. - K.pow(_beta1, t)))
    print(" - LR_T: "+str(K.eval(_lr_t)))

What I don't understand is that this learning rate increases. (with decay at default value of 0).

If we look at the learning_rate equation in Adam, we find this:

 lr_t = lr * (K.sqrt(1. - K.pow(self.beta_2, t)) /
                 (1. - K.pow(self.beta_1, t)))

which corresponds to the equation (with default values for parameters):

= 0.001*sqrt(1-0.999^x)/(1-0.99^x)

If we print this equation we obtain : enter image description here

which clearly shows that the learning_rate is increasing exponentially over time (since t starts at 1)

can someone explain why this is the case ? I have read everywhere that we should use a learning_rate that decays over time, not increase.

Does it means that my neural network makes bigger updates over time as Adam's learning_rate increases ?

Zhell
  • 95
  • 1
  • 1
  • 6
  • These equations for the learning rate are incomplete, you are not considering the division by the running mean of the squared gradient. – Dr. Snoopy Jun 04 '19 at 08:35
  • do you mean that after doing this division the actual learning rate may actually be decreasing ? – Zhell Jun 04 '19 at 08:55
  • No, I mean that your equations are incorrect, so you are drawing incorrect conclusions. – Dr. Snoopy Jun 04 '19 at 09:00
  • This equation is right from keras so I don't think it is incorrect, but maybe it's incomplete for what you are talking about. My "conclusion" is that the learning rate increases, so if it is incorrect it implies that the learning rate decreases, yet you tell me that this is not what you mean so I don't get it. Can you explain a bit more please ? – Zhell Jun 04 '19 at 09:14

1 Answers1

1

Looking at the source code of the Adam optimizer in Keras, it looks like the actual "decay" is performed at: this line. The code you reported is executed only after and is not the decay itself.
If the question is "why it is like that" I would suggest you to read some theory about Adam like the original paper.

EDIT
It should be clear that the update equation of the Adam optimizer does NOT include a decay by itself. The decay should be applied separately.

marco romelli
  • 1,143
  • 8
  • 19
  • I am using default values so self.initial_decay=0, which means this line is not used. yet I was told that Adam already does some form of learning_rate decay even if you give it the default parameters. I edited my post to make this point clearer – Zhell Jun 04 '19 at 08:57
  • Exactly, you are not using the decay so the learning rate doesn't decay... – marco romelli Jun 04 '19 at 09:33
  • Ok, I see why this learning rate doesn't decay. But shouldn't the updates to the parameters still decay because of the moments being taken into account by Adam? – Zhell Jun 04 '19 at 10:06
  • 1
    It's difficult to say, since the updates are different for every parameter, but it doesn't seem true to me in general. You can try to plot some of the updates over time and see how it behaves. – marco romelli Jun 04 '19 at 10:14