2

I read from the documentation:

tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=False, reduction="auto", name="sparse_categorical_crossentropy"
)

Computes the crossentropy loss between the labels and predictions.

Use this crossentropy loss function when there are two or more label classes. We expect labels to be provided as integers. If you want to provide labels using one-hot representation, please use CategoricalCrossentropy loss. There should be # classes floating point values per feature for y_pred and a single floating point value per feature for y_true.

Why is this called sparse categorical cross entropy? If anything, we are providing a more compact encoding of class labels (integers vs one-hot vectors).

Josh
  • 11,979
  • 17
  • 60
  • 96
  • it switches from integer (compact) to one-hot (sparse) – Marco Cerliani Jun 22 '20 at 15:22
  • @MarcoCerliani it's actually the opposite thus my confusion. The sparse version takes in the true labels as integers, whereas the non-sparse one takes in true labels as one-hot encoded vectors. – Josh Jun 22 '20 at 15:24
  • yes...It's only question of name... I think you understood the concept. would you like a practical example ? – Marco Cerliani Jun 22 '20 at 15:26

2 Answers2

1

I think this is because integer encoding is more compact than one-hot encoding and thus more suitable for encoding sparse binary data. In other words, integer encoding = better encoding for sparse binary data.

This can be handy when you have many possible labels (and samples), in which case a one-hot encoding can be significantly more wasteful than a simple integer per example.

Josh
  • 11,979
  • 17
  • 60
  • 96
0

Why exactly it is called like that is probably best answered by Keras devs. However, note that this sparse cross-entropy is only suitable for "sparse labels", where exactly one value is 1 and all others are 0 (if the labels were represented as a vector and not just an index).

On the other hand, the general CategoricalCrossentropy also works with targets that are not one-hot, i.e. any probability distribution. The values just need to be between 0 and 1 and sum to 1. This tends to be forgotten because the use case of one-hot targets is so common in current ML applications.

xdurch0
  • 9,905
  • 4
  • 32
  • 38
  • Thanks, but I'm confused, from what I gather both versions only accept as **true labels** for the loss a set of categorical choices that are either one-hot encoded or integers. This would make sense to me as they are **categorical** cross entropies after all. However, are you saying that, in fact, the `CategoricalCrossEntropy` loss function accepts vectors with real positive values for the **true label** argument? – Josh Jun 22 '20 at 15:27
  • Yes -- it measures the cross entropy between two categorical probability distributions. Such distributions assign _probabilities_ to a set of discrete categories, but probabilities can be between 0 and 1 -- they don't have to be hard 0s and 1s. As such, the `label` can be _any categorical probability distribution_ (just like the predictions are usually soft probabilities, gotten from a softmax layer). Check equation 1 on the Wiki page: https://en.wikipedia.org/wiki/Cross_entropy No need for p or q to be one-hot! – xdurch0 Jun 22 '20 at 16:10
  • This answer is wrong. As the Keras [docs](https://keras.io/api/losses/probabilistic_losses/#sparsecategoricalcrossentropy-class) point out, `SparseCategoricalCrossentropy` expects labels to be provided as integers. This is not a sparse representation of the data, hence the OP's question. Also, as the [docs](https://keras.io/api/losses/probabilistic_losses/#categoricalcrossentropy-class) further indicate, `CategoricalCrossentropy` expects targets to be one-hot encoded. You cannot pass targets with values in the range 0-1. It may not be mathematically necessary but Keras requires this. – codeananda Dec 04 '20 at 08:58
  • 1. Actually, providing integers of those positions that are non-0 is _exactly_ what a sparse representation is. 2. Yes, `CategoricalCrossentropy` works on any probability distribution, not just one-hot targets. Have you actually _tried_ it? Note we are talking about `tf.keras` here, I don't know about "standalone" Keras. – xdurch0 Dec 04 '20 at 11:19