I am trying to implement a network which has the following loss function definition in Pytorch
logits = F.log_softmax(layer_output)
loss = F.nll_loss(logits, labels)
This link https://discuss.pytorch.org/t/pytorch-equivalence-to-sparse-softmax-cross-entropy-with-logits-in-tensorflow/18727 mentions that log_softmax should be used instead of softmax as it is more stable before calculating nll loss
In tensorflow i have the following code
logits = tf.nn.log_softmax(layer_output)
loss = .tf.losses.log_loss(logits, labels)
This leads to a NAN value in loss from first iteration. If i use tf.nn.softmax I don't have a NAN value. But the link mentions log_loss is more stable. Is there any specific reason for this? I could get rid of the NANs using tf.clip_by_value but that leads to vanishing gradients.