Does the last layer of a classifier neural network use both sigmoid and softmax?

Question

Hidden layers of a classifier network use sigmoid or another activation function to introduce non-linearity and normalize the data, but does the last layer use sigmoid in conjunction with softmax?

I have a feeling it doesn't matter and the network will train either way -- but should a softmax layer alone be used? or should the sigmoid function be applied first?

score 3 · Accepted Answer · answered Oct 07 '17 at 20:59

3

In general, there's no point in additional sigmoid activation just before the softmax output layer. Since the sigmoid function is a partial case of softmax, it will just squash the values into [0, 1] interval two times in a row, which would give be a nearly uniform output distribution. Of course, you can propagate through this, but it'll be much less efficient.

By the way, if you chose not to use ReLu, tanh is by all means a better activation function than sigmoid.

answered Oct 07 '17 at 20:59

Maxim

52,561
27
155
209

Thanks! Can you direct me to a resource where I can read further about tanh vs sigmoid in classifiers? I've seen them described as extremely comparable before. – Evan Weissburg Oct 07 '17 at 21:00
1

@EvanWeissburg Sure, highly recommend this post - http://cs231n.github.io/neural-networks-1/#actfun – Maxim Oct 07 '17 at 21:04

Does the last layer of a classifier neural network use both sigmoid and softmax?

1 Answers1