Rather than training a neural network to output 1 or 0 through the output sigmoid layer, LeCun recommends (in the paper "Efficient BackProp" - LeCun et al, 1998, section 4.5):
Choose target values at the point of the maximum second derivative on the sigmoid so as to avoid saturating the output units.
And here (https://machinelearningmastery.com/best-advice-for-configuring-backpropagation-for-deep-learning-neural-networks/), the values of 0.9 and 0.1 are recommended.
This raises two questions:
- Is this possible using keras? It seems like the two cross entropy functions of keras (
BinaryCrossentropy
andCategoricalCrossentropy
) both expect target values of 1 or 0. - Assuming I have more than two classes, the sum of the expected values would be greater than 1 (i.e, not a probability distribution). Is that a problem? I assume not, as long as you know how to interpret the values. To clarify, I'd rather not use softmax in such a case, and instead stick with sigmoids. By sticking with sigmoids I think it'd be easier to interpret each class's output as a confidence metric, unlike in softmax where the exponentiation makes that more difficult. The fact that the sum of the probabilities of all the classes doesn't equal 1, shouldn't prevent me from interpreting each individual class's output as a probability / confidence metric, IIUC (I realize this isn't strictly true in a mathematical sense, but intuitively this makes sense to me, if I'm wrong please point out my error).