4

This is the screen of the original paper: the screen of the paper. I understand the meaning of the paper is that when the value of dot-product is large, the gradient of softmax will get very small.
However, I tried to calculate the gradient of softmax with the cross entropy loss and found that the gradient of softmax is not directly related to value passed to softmax.
Even the single value is large, it still can get a large gradient when ather values are large. (sorry about that I don't know how to pose the calculation process here)

Community
  • 1
  • 1

1 Answers1

3

Actually the gradient of cross entropy with softmax on a one hot encoding vector is just grad -log(softmax(x)) = (1 - softmax(x)) at the index of the vector of the corresponding class. (https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/). If the value passed to the softmax is large, the softmax will produce 1 and therefore produce 0 gradient.

Thomas Pinetz
  • 6,948
  • 2
  • 27
  • 46
  • Hi, I may did not expressed clearly in the question. The point is that I think the value X passed to softmax should be a vector like [ x1,...xi...xn ], so it doesn't matter if a single value 'xi' is large as long as each xi in X is with the same magnitude, then the result of softmax would not be equal to 1. Am I right? – Richard. Zhu Feb 28 '19 at 06:21
  • Yes, but minor deviation will easily dominate in the softmax if you blow the values up. E.g. consider logits which are 0.3 and 0.4. Then the softmax will not be one. But if you multiply both numbers by 100 to 30 and 40, then the softmax will be 1, even though the relative difference is the same. – Thomas Pinetz Feb 28 '19 at 06:26
  • I don't think this should be explained by one-hot example, in paper, the softmax result is attention, and the follow up operation is multiply a value matrix, not CE with a one hot vector. – hihell Jul 13 '20 at 17:31