4

Hidden layers of a classifier network use sigmoid or another activation function to introduce non-linearity and normalize the data, but does the last layer use sigmoid in conjunction with softmax?

I have a feeling it doesn't matter and the network will train either way -- but should a softmax layer alone be used? or should the sigmoid function be applied first?

Evan Weissburg
  • 1,564
  • 2
  • 17
  • 38

1 Answers1

3

In general, there's no point in additional sigmoid activation just before the softmax output layer. Since the sigmoid function is a partial case of softmax, it will just squash the values into [0, 1] interval two times in a row, which would give be a nearly uniform output distribution. Of course, you can propagate through this, but it'll be much less efficient.

By the way, if you chose not to use ReLu, tanh is by all means a better activation function than sigmoid.

Maxim
  • 52,561
  • 27
  • 155
  • 209
  • Thanks! Can you direct me to a resource where I can read further about tanh vs sigmoid in classifiers? I've seen them described as extremely comparable before. – Evan Weissburg Oct 07 '17 at 21:00
  • 1
    @EvanWeissburg Sure, highly recommend this post - http://cs231n.github.io/neural-networks-1/#actfun – Maxim Oct 07 '17 at 21:04