4

I am training a neural network to distinguish between three classes. Naturally, I went for PyTorch's CrossEntropyLoss. During experimentation, I realized that the loss was significantly higher when a Softmax layer was put at the end of the model. So I decided to experiment further:

import torch
from torch import nn

pred_1 = torch.Tensor([[0.1, 0.2, 0.7]])
pred_2 = torch.Tensor([[1, 2, 7]])
pred_3 = torch.Tensor([[2, 4, 14]])
true = torch.Tensor([2]).long()

loss = nn.CrossEntropyLoss()

print(loss(pred_1, true))
print(loss(pred_2, true))
print(loss(pred_3, true))

The result of this code is as follows:

0.7679
0.0092
5.1497e-05

I also tried what happens when multiplying the input with some constant. losses for different multipliers

Several sources (1, 2) stated that the loss has a softmax built in, but if that were the case, I would have expected all of the examples above to return the same loss, which clearly isn't the case.

This poses the following question: if bigger outputs lead to a lower loss, wouldn't the network optimize towards outputting bigger values, thereby skewing the loss curves? If so, it seems like a Softmax layer would fix that. But since this results in a higher loss overall, how useful would the resulting loss actually be?

Matze
  • 368
  • 1
  • 3
  • 12
  • 3
    I’m voting to close this question because it is not about programming as defined in the [help] but about ML theory and/or methodology - please see the intro and NOTE in https://stackoverflow.com/tags/machine-learning/info – desertnaut Dec 30 '21 at 14:49

1 Answers1

4

From the docs, the input to CrossEntropyLoss "is expected to contain raw, unnormalized scores for each class". Those are typically called logits.

There are two questions:

  • Scaling the logits should not yield the same cross-entropy. You might be thinking of a linear normalization, but the (implicit) softmax in the cross-entropy normalizes the exponential of the logits.
  • This causes the learning to optimize toward larger values of the logits. This is exactly what you want because it means that the network is more "confident" of the classification prediction. (The posterior p(c|x) is closer to the ground truth.)
ATony
  • 683
  • 2
  • 12
  • Thank you, that makes a lot of sense. However, could it in theory be possible for the network to just keep producing larger and larger outputs, thereby reducing the loss? I reccon that in practice this will never be the case due to a variety of reasons. But it seems a bit odd to me that simply raising all output nodes results in a lower loss. – Matze Dec 30 '21 at 15:18
  • 1
    It is totally possible, and not just in theory... This is usually not a problem because (1) the loss saturates and thus the "incentive" (i.e., gradient) converges to zero, and (2) one should stop training early (a.k.a., early-stopping) based on the validation loss. Moreover, one typically adds regularization on the layer weights which translates on a loss term pulling in the opposite direction (and which stabilizes it thanks to (1)). – ATony Dec 30 '21 at 15:50