2

Rather than training a neural network to output 1 or 0 through the output sigmoid layer, LeCun recommends (in the paper "Efficient BackProp" - LeCun et al, 1998, section 4.5):

Choose target values at the point of the maximum second derivative on the sigmoid so as to avoid saturating the output units.

And here (https://machinelearningmastery.com/best-advice-for-configuring-backpropagation-for-deep-learning-neural-networks/), the values of 0.9 and 0.1 are recommended.

This raises two questions:

  1. Is this possible using keras? It seems like the two cross entropy functions of keras (BinaryCrossentropy and CategoricalCrossentropy) both expect target values of 1 or 0.
  2. Assuming I have more than two classes, the sum of the expected values would be greater than 1 (i.e, not a probability distribution). Is that a problem? I assume not, as long as you know how to interpret the values. To clarify, I'd rather not use softmax in such a case, and instead stick with sigmoids. By sticking with sigmoids I think it'd be easier to interpret each class's output as a confidence metric, unlike in softmax where the exponentiation makes that more difficult. The fact that the sum of the probabilities of all the classes doesn't equal 1, shouldn't prevent me from interpreting each individual class's output as a probability / confidence metric, IIUC (I realize this isn't strictly true in a mathematical sense, but intuitively this makes sense to me, if I'm wrong please point out my error).
JMS
  • 1,039
  • 4
  • 12
  • 20

2 Answers2

2

It seems like the two cross entropy functions of keras (BinaryCrossentropy and CategoricalCrossentropy) both expect target values of 1 or 0.

The docs are lying; cross-entropy is a measure for the difference between probability distributions -- any two probabiltiy distributions p and q. There is really no requirement for either of them to be one-hot. Anyway, both BinaryCrossentropy and CategoricalCrossentropy have a label_smoothing argument that you can use for this purpose. A label_smoothing of k will modify your targets like this:

smooth_targets = (1 - k)*hard_targets + k*uniform_targets

So for example in the binary case, a label smoothing of 0.1 will result in targets of (0.05, 0,95) instead of (0, 1).

I don't really understand the second part of the question, but this idea can be generalized to multiple classes. E.g. for 10 classes, you could use 0.91 instead of 1 for the true class, and 0.01 instead of 0 for the other classes. Still sums to 1.

xdurch0
  • 9,905
  • 4
  • 32
  • 38
  • Thanks, is it required to use `label_smoothing` to modify target values, or can I just set the values I want manually? Regarding the second part of my question, that's not exactly what I meant. I want each class to have a value of 0.1, and to not have to scale it down so that the sum equals 1, because based on the mathematical nature of the sigmoid, using a value of ~0.1 is desirable, and going to e.g 0.01 would lose the effect that LeCun describes. P.S. I'm surprised the docs are lying! Seems like a surprising and unfortunate oversight. – JMS May 26 '23 at 17:38
  • The reason I ask regarding setting it manually is that I am uncertain regarding the labels of some of my classes, so I'd like to assign values between 0.9 and 0.1 to those uncertain classes. So providing a vector that is purely one hot encoded and relying on `label_smoothing` wouldn't work for me (for those "uncertain" classes) – JMS May 26 '23 at 17:41
  • what do you think? – JMS May 27 '23 at 17:50
0

Well answering your this question:

The reason I ask regarding setting it manually is that I am uncertain regarding the labels of some of my classes, so I'd like to assign values between 0.9 and 0.1 to those uncertain classes. So providing a vector that is purely one hot encoded and relying on label_smoothing wouldn't work for me (for those "uncertain" classes)

import numpy as np
#Now take your target values
target_values = np.array([
    [1.0, 0.0, 0.0],  # Class 1
    [0.0, 1.0, 0.0],  # Class 2
    [0.2, 0.8, 0.0],  # Class 3 (Example with uncertainty)
    [0.0, 0.0, 1.0],  # Class 4
])
# Convert class labels to target values
target_labels = np.array([1, 2, 3, 4])  # Example class labels
target_vectors = target_values[target_labels - 1]  # Subtract 1 to match 0-based indexing
target_vectors

array([[1. , 0. , 0. ],
       [0. , 1. , 0. ],
       [0.2, 0.8, 0. ],
       [0. , 0. , 1. ]])

In the model.compile() use this

you can use the BinaryCrossentropy or CategoricalCrossentropy loss functions with the from_logits argument set to True. This will tell Keras that the target values are already in logit space, and it will not apply the sigmoid function to them before computing the loss.

cross_entropy_loss = tf.keras.losses.BinaryCrossentropy(from_logits=True)
model.compile(optimizer='adam', loss=cross_entropy_loss, metrics=['accuracy'])

# Train the model using the manually set target vectors
model.fit(X, target_vectors, batch_size=32, epochs=10)

It seems you are asking for this...

Mohammad Ahmed
  • 1,544
  • 2
  • 9
  • 12