4

Using pyTorch and tensorflow (TF), I was wandering how the Adam optimizer is implemented for curiosity. And I do not know if I am wrong or not but it seems to me that the two implementations differ and the pyTorch one is the original one from https://arxiv.org/pdf/1412.6980.pdf.

My problem comes from the eps-parameter. Using the TF implentation seems to lead to a time-and-b2 dependance of this parameter, namely

q(t+1) = q(t) - \gamma * sqrt[(1-b2^t)]/(1-b1^t) * m(t)/[sqrt[v(t)]+eps]

which in the original algorithm notation can be reformulate as

 q(t+1) = q(t) - \gamma * mhat(t)/[sqrt[vhat(t)]+ eps/sqrt[(1-b2^t)]]

and this point out the variation of the eps-parameter which is not the case neither in the original algorithm neither in the pyTorch implementation.

Am I wrong? or it is well known? Thanks for your help.

Jean-Eric
  • 372
  • 2
  • 14

2 Answers2

3

You can derive the formula from the second to the first as following. So in the tensorflow implementation, the epsilon is actually the epsilon' here. Also, in the tensorflow implementation, the learning rate is adjusted to alpha' in the following formula. Hope this helps.

enter image description here

zihaozhihao
  • 4,197
  • 2
  • 15
  • 25
2

Indeed, you can check this in the docs for the TF Adam optimizer. To quote the relevant part:

The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. Note that since AdamOptimizer uses the formulation just before Section 2.1 of the Kingma and Ba paper rather than the formulation in Algorithm 1, the "epsilon" referred to here is "epsilon hat" in the paper.

If you check the "the formulation just before section 2.1" in the paper, they actually include the time dependence in alpha, resulting in a time-dependent "step size" alpha_t but a fixed epsilon. Note that at the end of the day, this is just rewriting/interpreting parameters in a slightly different fashion and doesn't change the actual workings of the algorithm. But you will need to be aware that choosing the same epsilon in the PyTorch and TF implementations will apparently not lead to the same results...

xdurch0
  • 9,905
  • 4
  • 32
  • 38
  • Cool! If this answer helped you, consider upvoting and accepting so others can see, as well. – xdurch0 Sep 06 '19 at 19:38
  • "If you check the "the formulation just before section 2.1" in the paper, they actually include the time dependence in alpha, resulting in a time-dependent "step size" alpha_t but a fixed epsilon." I don't think this is correct. As I understand it, the second formulation (just before section 2.1) uses epsilon hat, which, by my math, is the epsilon prime in zihaozhihao's answer above (and is not fixed). – MindSeeker Mar 11 '22 at 23:49