1

In Keras or Tensorflow clipnorm rescales large "gradients" to have a specific norm and clipvalue bounds all the values of the "gradient".

But what happens if you combine one of them with moemntum or something like adam? Is it applied on the gradient or rather on the velocity?

A) Is clipnorm applied on the actual pure mathematical gradient g of the loss with respect to the parameters and then this clipped gradient is used to calculate the update step using the momentum of the old gradients and the learning rate?

velocity = momentum * velocity - learning_rate * clipnorm(g)
w = w + velocity

or

B) First the momentum of the old gradients is combined with the unmodified new gradient. Then the resulting vector (the "velocity") gets scaled by clipnorm.

velocity = clipnorm(momentum * velocity - learning_rate * g)
w = w + velocity

or B')

velocity = momentum * velocity - learning_rate * g
w = w + clipnorm(velocity)

or there would also be the possibility of A')

velocity = momentum * velocity - clipnorm(learning_rate * g)
w = w + velocity

?

A (and A') would suffer from the problem that even though the norm of the gradient is bounded the velocity could get arbitrarily large due to momentum and the clipnorm would make it even slower to break down the velocity or change the direction.

From my perspective B would be the most reasonable, but I don't know how it is actually implemented.

The same question can be analogously asked for clipvalue and adam and other momentum based algorithms.

PS: If clipnorm is not implemented as suggested in B, I would be intrested if there is also a possibility to get B or B' in keras by using a different option?

Jakob
  • 1,063
  • 9
  • 17
  • 1
    Usually for these types of inquiries, the best way to find the answer is to look at the code. Have a look at https://github.com/tensorflow/tensorflow/blob/813ef350aa2858a8764d318ef521d97dd62c5859/tensorflow/python/keras/optimizer_v2/optimizer_v2.py#L382 (Optimizer_v2 class) which is the super class of SGD to see the order in which these operations are done. – Ahmad Baracat Oct 18 '20 at 12:19
  • Thanks, but I am not able to get the answer to my question from that source code (I don't see where and how the momentum is treated). It would be extremely helpful if someone who is more familiar with the source code could just say if the answer to my question is A, A', B or B' (or something different). – Jakob Oct 18 '20 at 19:22
  • I think it is rather A than B? I think it is NOT applied on the velocity? Is this correct? Additionally things get more messy because apparently it is not applied on the complete gradient, but only on the gradients of each tensor seperately (`global_clipnorm` will be apllied on the complete gradient). I would be very interested if someone could confirm this? And I would be very interested to hear if it is also possible to prevent the velocity from getting to large in momentum-based algorithms (as proposed in B)? – Jakob Oct 21 '20 at 14:00

0 Answers0