At forward time we have (with x = input vector, y = output vector, f = logsoftmax, i = i-th component):
yi = f(xi)
= log( exp(xi) / sum_j(exp(xj)) )
= xi - log( sum_j(exp(xj)) )
When computing the jacobian Jf of f you have (i-th row):
dyi/dxi = 1 - exp(xi) / sum_j(exp(xj))
And for k different than i:
dyi/dxk = - exp(xk) / sum_j(exp(xj))
This gives for Jf:
1-E(x1) -E(x2) -E(x3) ...
-E(x1) 1-E(x2) -E(x3) ...
-E(x1) -E(x2) 1-E(x3) ...
...
With E(xi) = exp(xi) / sum_j(exp(xj))
If we name gradInput the gradient w.r.t input and gradOutput the gradient w.r.t output the backpropagation gives (chain rule):
gradInputi = sum_j( gradOutputj . dyj/dxi )
This is equivalent to:
gradInput = transpose(Jf) . gradOutput
Which finally gives for the i-th component:
gradInputi = gradOutputi - E(xi) . sum_j( gradOutputj )
So the first loop pre-computes sum_j( gradOutputj )
and the last one the above term, i.e. i-th component of grad. input - except there is a missing 1 / sum_j(exp(xj))
for the exponential term in the Torch implementation (the above calculus should probably be double checked even though it sounds correct and explains the current implementation).
UPDATE: there is no problem with the missing 1 / sum_j(exp(xj))
term. Since the the jacobian is computed on the output value, and since this formerly computed output is precisely a log-softmax distribution it comes that the sum-exp of this distribution is 1:
sum_j(exp(outputj)) = sum_j(exp( log(exp(inputj) / sum_k(exp(inputk) ))
= sum_j( exp(inputj) / sum_k(exp(inputk) )
= 1
So there is no need to explicit this term in the implementation, which gives (for x = output):
gradInputi = gradOutputi - exp(outputi) . sum_j( gradOutputj )