Using pyTorch and tensorflow (TF), I was wandering how the Adam optimizer is implemented for curiosity. And I do not know if I am wrong or not but it seems to me that the two implementations differ and the pyTorch one is the original one from https://arxiv.org/pdf/1412.6980.pdf.
My problem comes from the eps-parameter. Using the TF implentation seems to lead to a time-and-b2 dependance of this parameter, namely
q(t+1) = q(t) - \gamma * sqrt[(1-b2^t)]/(1-b1^t) * m(t)/[sqrt[v(t)]+eps]
which in the original algorithm notation can be reformulate as
q(t+1) = q(t) - \gamma * mhat(t)/[sqrt[vhat(t)]+ eps/sqrt[(1-b2^t)]]
and this point out the variation of the eps-parameter which is not the case neither in the original algorithm neither in the pyTorch implementation.
Am I wrong? or it is well known? Thanks for your help.