I am training a neural network to distinguish between three classes. Naturally, I went for PyTorch's CrossEntropyLoss
. During experimentation, I realized that the loss was significantly higher when a Softmax
layer was put at the end of the model. So I decided to experiment further:
import torch
from torch import nn
pred_1 = torch.Tensor([[0.1, 0.2, 0.7]])
pred_2 = torch.Tensor([[1, 2, 7]])
pred_3 = torch.Tensor([[2, 4, 14]])
true = torch.Tensor([2]).long()
loss = nn.CrossEntropyLoss()
print(loss(pred_1, true))
print(loss(pred_2, true))
print(loss(pred_3, true))
The result of this code is as follows:
0.7679
0.0092
5.1497e-05
I also tried what happens when multiplying the input with some constant.
Several sources (1, 2) stated that the loss has a softmax built in, but if that were the case, I would have expected all of the examples above to return the same loss, which clearly isn't the case.
This poses the following question: if bigger outputs lead to a lower loss, wouldn't the network optimize towards outputting bigger values, thereby skewing the loss curves? If so, it seems like a Softmax
layer would fix that. But since this results in a higher loss overall, how useful would the resulting loss actually be?