10

This question is basically for the working of Keras or tf.keras for people who have the verty deep knowledge of the framework

According to my knowledge, tf.keras.optimizers.Adam is an optimizer which has already an Adaptive Learning rate scheme. So if we are using from keras.callbacks.ReduceLROnPlateau with the Adam optimizer or any other, isn't it meaningless to do so? I don't have the very inner workings of Keras based Optimizer but it looks natural to me that if we are using the adaptive optimizer, why to to use this and If we use this given callback, what would be the effect on the training?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Deshwal
  • 3,436
  • 4
  • 35
  • 94
  • 3
    Doesn't seem to actually be a *programming* question - after all, [it works with Adam](https://stackoverflow.com/questions/52134926/reducelronplateau-gives-error-with-adam-optimizer). You seem to ask a *theoretical* question. – desertnaut Feb 23 '21 at 14:21
  • Adam still has a "default" learning rate that is simply scaled by all the adaptive thingamajigs, so it seems "obvious" to me that it will affect the learning. If you wonder _how_ it affects it, why not try it out? – xdurch0 Feb 23 '21 at 14:40
  • 4
    I use the Adam optimizer along with ReduceLROnPlateau and it works fine. Not sure on what basis the Adam optimizer adjust the learning rate if in fact it does but what you want is to reduce the learning rate based on the validation loss in general. – Gerry P Feb 23 '21 at 16:41
  • @desertnaut Well I thought if someone from `Keras` team might help if there is use or no use to it because some things are built for general purpose and need not used with something even though you CAN. – Deshwal Feb 24 '21 at 05:47
  • @xdurch0 So if I use `AdaGrad`, now would it make sense? It has different learning rate for different parameters. So what now? Can someone please point me towards a link or so? – Deshwal Feb 24 '21 at 05:48
  • 1
    AdaGrad *still* has a global learning rate. Yes, every parameter has "a different" learning rate but these are all _based on_ a global learning rate. Essentially, `learning_rate(param) = global_learning_rate * adaptive_terms(param)`. Changing the learning rate in the Keras optimizers modifies this global learning rate, which acts as a scale for all the per-parameter learning rates. – xdurch0 Feb 24 '21 at 10:30
  • @xdurch0 Thanks a lot. I got the idea. So `ReduceLR` is not hurting and is there for a reason. Thanks a lot. – Deshwal Feb 25 '21 at 11:24
  • 2
    To add one more thing, one optimizer where you'd actually be right that there is _no_ global learning rate is Adadelta. However, here the Keras people simply added this in the implementation even though it's not in the paper. Generally, I can confirm from many experiments that reducing LR on plateau can help _a lot_ even with adaptive optimizers like Adam. Give it a try! – xdurch0 Feb 25 '21 at 14:01
  • Yeah. I have been using `Adam` with `ReduceLR` but one day it just struck my mind that is it even helping or not even though I could see the changes. Thanks for your insight and help. – Deshwal Feb 26 '21 at 03:17

1 Answers1

4

Conceptually, consider the gradient a fixed, mathematical value from automatic differentiation.

What every optimizer other than pure SGD does is to take the gradient and apply some statistical analysis to create a better gradient. In the simplest case, momentum, the gradient is averaged with previous gradients. In RMSProp, the variance of the gradient across batches is measured - the noisier it is, the less RMSProp "trusts" the gradient and so the gradient is reduced (divided by the stdev of the gradient for that weight). Adam does both.

Then, all optimizers multiply the statistically adjusted gradient by a learning rate.

So although one colloquial description of Adam is that it automatically tunes a learning rate... a more informative description is that Adam statistically adjusts gradients to be more reliable, but you still need to decide on a learning rate and how it changes during training (e.g. a LR policy). ReduceLROnPlateau, cosine decay, warmup, etc are examples of an LR policy.

Whether you program TF or PyTorch, the psuedocode on PyTorch's optimizers are my go to to understand the optimizer algorithms. Looks like a wall of greek letters as first, but you'll grok it if you stare at it for a few minutes.

https://pytorch.org/docs/stable/optim.html

Yaoshiang
  • 1,713
  • 5
  • 15
  • Thanks , however the question seems to be like, as AMAD corrects the Leraning Rate, so does ReduceOnPlate and LR-Schedulers/Decay etc, would it be over doing..if we use such thing along with ADAM .. ? – user2458922 Dec 26 '22 at 23:47
  • 1
    Adjusting LR is still necessary even with an adaptive optimizer like ADAM. – Yaoshiang Dec 30 '22 at 23:42