1

I misread PyTorch's NLLLoss() and accidentally passed my model's probabilities to the loss function instead of my model's log probabilities, which is what the function expects. However, when I train a model under this misused loss function, the model (a) learns faster, (b) learns more stably, (b) reaches a lower loss, and (d) performs better at the classification task.

I don't have a minimal working example, but I'm curious if anyone else has experienced this or knows why this is? Any possible hypotheses?

One hypothesis I have is that the gradient with respect to the misused loss function is more stable because the derivative isn't scaled by 1/model output probability.

Rylan Schaeffer
  • 1,945
  • 2
  • 28
  • 50

0 Answers0