0

I am just building a deep nerual networks, and I find my network converge faster when there is no activation function (softmax) in fully connected layer. But when I add this softmax function, the convergence is really bad and even stop in really high loss. Btw. I use the cross-entropy loss as my loss function and rmsprop as my optimizer. Nerual network without last softmax activation

2 Answers2

1

CrossEntropyLoss assumes logits on the input. So probabilities will not do, as they -- as the most obvious problem -- massively reduce the dynamic range from +/- inf to <0; 1>.

If you want the outputs of the network to be already normalized, I'd strongly recommend LogSoftmax() as the activation combined with NLLLoss as the criterion.

Training with plain softmax is risky for numerical reasons; the output can be easily postprocessed if probabilities are needed in the downstream task.

dedObed
  • 1,313
  • 1
  • 11
  • 19
0

If you use cross-entropy as loss function for your model, you have to make sure your final outputs are valid probabilities, i.e. they sum up to one, are non-negative and in (0,1).

This is ensured by softmax activation which will scale the outputs from your last layer to probabilities. If you omit this step the loss you are calculating is not correct and representative for the convergence / learning of your model.

Tinu
  • 2,432
  • 2
  • 8
  • 20