3

Does it make sense to use not binary ground truth values for binary crossentropy? is there any formal proof?

Looks like it used in practice: for example in https://blog.keras.io/building-autoencoders-in-keras.html, i.e. mnist images are not binary, but gray images.

Here is code examples:

1.Normal case:

def test_1():
    print('-'*60)

    y_pred = np.array([0.5, 0.5])
    y_pred = np.expand_dims(y_pred, axis=0)
    y_true = np.array([0.0, 1.0])
    y_true = np.expand_dims(y_true, axis=0)

    loss = keras.losses.binary_crossentropy(
        K.variable(y_true),
        K.variable(y_pred)
    )

    print("K.eval(loss):", K.eval(loss))

Output:

K.eval(loss): [0.6931472]

2.Not binary ground truth values case:

def test_2():
    print('-'*60)

    y_pred = np.array([0.0, 1.0])
    y_pred = np.expand_dims(y_pred, axis=0)
    y_true = np.array([0.5, 0.5])
    y_true = np.expand_dims(y_true, axis=0)

    loss = keras.losses.binary_crossentropy(
        K.variable(y_true),
        K.variable(y_pred)
    )

    print("K.eval(loss):", K.eval(loss))

Output:

K.eval(loss): [8.01512]

3.Ground truth values out of [0,1] range:

def test_3():
    print('-'*60)

    y_pred = np.array([0.5, 0.5])
    y_pred = np.expand_dims(y_pred, axis=0)
    y_true = np.array([-2.0, 2.0])
    y_true = np.expand_dims(y_true, axis=0)

    loss = keras.losses.binary_crossentropy(
        K.variable(y_true),
        K.variable(y_pred)
    )

    print("K.eval(loss):", K.eval(loss))

Output:

K.eval(loss): [0.6931472]

For some reason loss in test_1 and test_3 is the same, maybe it's because clipping [-2, 2] to [0, 1] but I can't see clipping code in Keras code. Also it's interesting that for test_1 and test_2 loss value has large difference but in 1st case we have [0.5, 0.5] and [0.0, 1.0] and in 2nd case we have [0.0, 1.0] and [0.5, 0.5], which is the same values but in reversed order.

In Keras binary_crossentropy defined as:

def binary_crossentropy(y_true, y_pred):
    return K.mean(K.binary_crossentropy(y_true, y_pred), axis=-1)


def binary_crossentropy(target, output, from_logits=False):
    """Binary crossentropy between an output tensor and a target tensor.

    # Arguments
        target: A tensor with the same shape as `output`.
        output: A tensor.
        from_logits: Whether `output` is expected to be a logits tensor.
            By default, we consider that `output`
            encodes a probability distribution.

    # Returns
        A tensor.
    """
    # Note: tf.nn.sigmoid_cross_entropy_with_logits
    # expects logits, Keras expects probabilities.
    if not from_logits:
        # transform back to logits
        _epsilon = _to_tensor(epsilon(), output.dtype.base_dtype)
        output = tf.clip_by_value(output, _epsilon, 1 - _epsilon)
        output = tf.log(output / (1 - output))

    return tf.nn.sigmoid_cross_entropy_with_logits(labels=target,
                                                   logits=output)
mrgloom
  • 20,061
  • 36
  • 171
  • 301

1 Answers1

2

Yes, it "makes sense" in that the cross-entropy is a measure for the difference between probability distributions. That is, any distributions (over the same sample space of course) -- the case where the target distribution is one-hot is really just a special case, despite how often it is used in machine learning.

In general, if p is your true distribution and q is your model, cross-entropy is minimized for q = p. As such, using cross-entropy as a loss will encourage the model to converge towards the target distribution.

As for the difference between cases 1 and 2: Cross-entropy is not symmetric. It is actually equal to the entropy of the true distribution p plus the KL-divergence between p and q. This implies that it will generally be larger for p closer to uniform (less "one-hot") because such distributions have higher entropy (I suppose the KL-divergence will also be different since it's not symmetric).

As for case 3: This is actually an artifact of using 0.5 as output. It turns out that in the cross-entropy formula, terms will cancel out in exactly such a way that you always get the same result (log(2)) no matter the labels. This will change when you use an output != 0.5; in this case, different labels give you different cross-entropies. For example:

  • output 0.3, target 2.0 gives cross-entropy of 2.0512707
  • output 0.3, target -2.0 gives cross-entropy of -1.3379208

The second case actually gives a negative output, which makes no sense. IMHO the fact that the function allows targets outside the range of [0,1] is an oversight; this should result in a crash. The cross-entropy formula works just fine, but the results are meaningless.

I would recommend also reading the wikipedia article on cross-entropy. It's quite short and has some useful information.

xdurch0
  • 9,905
  • 4
  • 32
  • 38