Why my deep nerual net descends slowly with softmax in the fully connected layer instead of without softmax in the fully connected layer?

Question

I am just building a deep nerual networks, and I find my network converge faster when there is no activation function (softmax) in fully connected layer. But when I add this softmax function, the convergence is really bad and even stop in really high loss. Btw. I use the cross-entropy loss as my loss function and rmsprop as my optimizer. Nerual network without last softmax activation

Please, share your code in a text format not as image. – Hamed Baziyad Jun 26 '20 at 06:36 — Hamed Baziyad, Jun 26 '20 at 06:36

score 1 · Answer 1 · answered Jun 26 '20 at 06:50

CrossEntropyLoss assumes logits on the input. So probabilities will not do, as they -- as the most obvious problem -- massively reduce the dynamic range from +/- inf to <0; 1>.

If you want the outputs of the network to be already normalized, I'd strongly recommend LogSoftmax() as the activation combined with NLLLoss as the criterion.

Training with plain softmax is risky for numerical reasons; the output can be easily postprocessed if probabilities are needed in the downstream task.

score 0 · Answer 2 · answered Jun 26 '20 at 06:17

If you use cross-entropy as loss function for your model, you have to make sure your final outputs are valid probabilities, i.e. they sum up to one, are non-negative and in (0,1).

This is ensured by softmax activation which will scale the outputs from your last layer to probabilities. If you omit this step the loss you are calculating is not correct and representative for the convergence / learning of your model.

Why my deep nerual net descends slowly with softmax in the fully connected layer instead of without softmax in the fully connected layer?

2 Answers2