I misread PyTorch's NLLLoss()
and accidentally passed my model's probabilities to the loss function instead of my model's log probabilities, which is what the function expects. However, when I train a model under this misused loss function, the model (a) learns faster, (b) learns more stably, (b) reaches a lower loss, and (d) performs better at the classification task.
I don't have a minimal working example, but I'm curious if anyone else has experienced this or knows why this is? Any possible hypotheses?
One hypothesis I have is that the gradient with respect to the misused loss function is more stable because the derivative isn't scaled by 1/model output probability.