-1

I'm training fully connected neural network to classify MNIST dataset. Index of the most saturated neuron in the output layer defines network output (digit from 0 to 9).

I would like to use tanh() activation function (just for learning purposes).

What is the correct way to represent image label as a vector (for generating errors vector which will be backpropagated)?

For sigmoid() activator this vector could be vector of zeros with only 1 in the position of the classified digit. Does that mean that for tanh() it should be vector of -1s instead of 0s (based on range of the function)? What is the general guidance?

Ribtoks
  • 6,634
  • 1
  • 25
  • 37
  • Did you use [softmax](https://en.wikipedia.org/wiki/Softmax_function)? – hkchengrex Aug 02 '18 at 06:23
  • @hkchengrex No, I haven't. I have 3 layers (28x28 for input pixels, 30 neurons in hidden layer and 10 neuros in output for each digit). Each layer uses same activation function. – Ribtoks Aug 02 '18 at 06:44
  • You should use one since it can represent competition between outputs. If you use plain sigmoid, what should you do when two neurons output '1'? Softmax is kind of like sigmoid but for all channels. It gives the confidence level for each label. I don't think you would still have this question if you use softmax. – hkchengrex Aug 02 '18 at 06:48
  • @hkchengrex I have no problem using `tanh()`. Probability that I have two exactly equal activation outputs in `double` is zero so there will be only one "most saturated neuron". My question is about different. How to represent the label as a vector for error calculation. Can you please read again the last paragraph of the Q.? Thanks – Ribtoks Aug 02 '18 at 07:11

1 Answers1

1

If you have to use tanh in this case, yes you would have make the image labels either -1 or 1. In this case, the 'correct' digit will be pushed to positive infinity and the 'wrong' digit will be pushed to negative infinity.

In general, I would suggest using a softmax instead. Their relationship is well explained here. (tanh is just sigmoid*2-1). While sigmoid does a binary classification problem (Is this a '7' or is it not), softmax does a multi-class classification (What digit is this most likely to be). The different is that softmax represent the probability distribution across all outputs (if I am very confident that this is a '1', there are lower probabilities that it is a '3' or '4' or others), while multiple independent sigmoid does not care.

In this case, since your output is a one-hot vector, value for each digit is definitely correlated with each other (i.e. a high response from '1' should suppress other responses). Using softmax will make training more stable and give better result.

hkchengrex
  • 4,361
  • 23
  • 33