tf.keras.losses.CategoricalCrossentropy gives different values than plain implementation

Question

Any one knows why raw implementation of Categorical Crossentropy function is so different from the tf.keras's api function?

import tensorflow as tf
import math
tf.enable_eager_execution()

y_true =np.array( [[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]])
y_pred = np.array([[.9, .05, .05], [.5, .89, .6], [.05, .01, .94]])

ce = tf.keras.losses.CategoricalCrossentropy()
res = ce(y_true, y_pred).numpy()
print("use api:")
print(res)

print()
print("implementation:")
step1 = -y_true * np.log(y_pred )
step2 = np.sum(step1, axis=1)

print("step1.shape:", step1.shape)
print(step1)
print("sum step1:", np.sum(step1, ))
print("mean step1", np.mean(step1))

print()
print("step2.shape:", step2.shape)
print(step2)
print("sum step2:", np.sum(step2, ))
print("mean step2", np.mean(step2))

Above gives:

use api:
0.3239681124687195

implementation:
step1.shape: (3, 3)
[[0.10536052 0.         0.        ]
 [0.         0.11653382 0.        ]
 [0.         0.         0.0618754 ]]
sum step1: 0.2837697356318653
mean step1 0.031529970625762814

step2.shape: (3,)
[0.10536052 0.11653382 0.0618754 ]
sum step2: 0.2837697356318653
mean step2 0.09458991187728844

If now with another y_true and y_pred:

y_true = np.array([[0, 1]])
y_pred = np.array([[0.99999999999, 0.00000000001]])

It gives:

use api:
16.11809539794922

implementation:
step1.shape: (1, 2)
[[-0.         25.32843602]]
sum step1: 25.328436022934504
mean step1 12.664218011467252

step2.shape: (1,)
[25.32843602]
sum step2: 25.328436022934504
mean step2 25.328436022934504

This [question](https://stackoverflow.com/q/67615051/10878733) might help. — Shubham Panchal, Jul 18 '21 at 11:05

Kaveh · Accepted Answer · 2021-07-18T12:17:09.057

The difference is because of these values: [.5, .89, .6], since it's sum is not equal to 1. I think you have made a mistake and you meant this instead: [.05, .89, .06].

If you provide the values with sum equal to 1, then both formulas results will be the same:

import tensorflow as tf
import numpy as np

y_true = np.array( [[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]])
y_pred = np.array([[.9, .05, .05], [.05, .89, .06], [.05, .01, .94]])

print(tf.keras.losses.categorical_crossentropy(y_true, y_pred).numpy())
print(np.sum(-y_true * np.log(y_pred), axis=1))

#output
#[0.10536052 0.11653382 0.0618754 ]
#[0.10536052 0.11653382 0.0618754 ]

However, let's explore how is calculated if the y_pred tensor is not scaled (the sum of values is not equal to 1)? If you look at the source code of categorical cross entropy here, you will see that it scales y_pred so that the class probas of each sample sum to 1:

if not from_logits:
    # scale preds so that the class probas of each sample sum to 1
    output /= tf.reduce_sum(output,
                            reduction_indices=len(output.get_shape()) - 1,
                            keep_dims=True)

since we passed a pred which the sum of probas is not 1, let's see how this operation changes our tensor [.5, .89, .6]:

output =  tf.constant([.5, .89, .6])
output /= tf.reduce_sum(output,
                            axis=len(output.get_shape()) - 1,
                            keepdims=True)
print(output.numpy())

# array([0.2512563 , 0.44723618, 0.30150756], dtype=float32)

So, it should be equal if we replace the above operation output (scaled y_pred), and pass it to your own implemented categorical cross entropy, with the unscaled y_pred passing to tensorflow implementation:

y_true =np.array( [[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]])

#unscaled y_pred
y_pred = np.array([[.9, .05, .05], [.5, .89, .6], [.05, .01, .94]])  
print(tf.keras.losses.categorical_crossentropy(y_true, y_pred).numpy())

#scaled y_pred (categorical_crossentropy scales above tensor to this internally)
y_pred = np.array([[.9, .05, .05], [0.2512563 , 0.44723618, 0.30150756], [.05, .01, .94]])  
print(np.sum(-y_true * np.log(y_pred), axis=1))

Output:

[0.10536052 0.80466845 0.0618754 ]
[0.10536052 0.80466846 0.0618754 ]

Now, let's explore the results of your second example. Why your second example shows different output? If you check the source code again, you will see this line:

output = tf.clip_by_value(output, epsilon, 1. - epsilon)

which clips values below than a threshold. Your input [0.99999999999, 0.00000000001] will be converted to [0.9999999, 0.0000001] in this line, so it gives you a different result:

y_true = np.array([[0, 1]])
y_pred = np.array([[0.99999999999, 0.00000000001]])

print(tf.keras.losses.categorical_crossentropy(y_true, y_pred).numpy())
print(np.sum(-y_true * np.log(y_pred), axis=1))

#now let's first clip the values less than epsilon, then compare loss
epsilon=1e-7
y_pred = tf.clip_by_value(y_pred, epsilon, 1. - epsilon)
print(tf.keras.losses.categorical_crossentropy(y_true, y_pred).numpy())
print(np.sum(-y_true * np.log(y_pred), axis=1))

Output:

#results without clipping values
[16.11809565]
[25.32843602]

#results after clipping values if there is a value less than epsilon (1e-7)
[16.11809565]
[16.11809565]

You got really sharp eyes and solid understanding on this. – Jason Jul 22 '21 at 03:17 — Jason, Jul 22 '21 at 03:17

tf.keras.losses.CategoricalCrossentropy gives different values than plain implementation

1 Answers1