Truly understanding Cross-Entropy-Loss

Question

I have a machine learning course, where I have to implement the forward and backward method of the CELoss:

class CELoss(object):
    @staticmethod
    def forward(x, y):
        assert len(x.shape) == 2 # x is batch of predictions   (batch_size, 10)
        assert len(y.shape) == 1 # y is batch of target labels (batch_size,)
        # TODO implement cross entropy loss averaged over batch
        return


    @staticmethod
    def backward(x, y, dout):
        # TODO implement dx
        dy = 0.0 # no useful gradient for y, just set it to zero
        return dx, dy

Moreover, I am given the CELoss as

CELoss(x,y) = - log\frac{exp(x_y)}{\sum_{k}exp(x_k)}

(it says I cannot use the formula creator because I need to have at least 10 reputations)

This, however is not the CELoss that you can find on wikipedia for example (https://en.wikipedia.org/wiki/Cross_entropy). From my understanding, the CELoss takes targets and predictions. Are x representing the targets here and y are the predictions? If so, what is x_y referring to? Thank you for your help!

score 0 · Answer 1 · answered Jun 19 '20 at 09:58

They are the same.

The cross-entropy loss that you give in your question corresponds to the particular case of cross-entropy where your labels are either 1 or 0, which I assume is the case if you're doing basic classification.

As to why this happens, let's start with the cross-entropy loss for a single training example x:

Loss = - sum_j P(x_j) log(Q(x_j)) #j is the index of possible labels

where P is the "true" distribution and "Q" is the distribution that your network has learned. The "true" distribution P is given by your hard labels, that is, assuming that the true label is t, you'll have:

P(x_t) = 1
P(x_j) = 0   if j!=t

which means that the loss above becomes

Loss= - log(Q_t)

In your case, it seems that the distribution Q_s is computed from the logits, i.e. the last layer before a softmax or cost function which outputs a set of scores for each label:

scores= [s_1 , ..., s_N]

if you run that through a softmax, you get:

distribution = [exp(s_1)/(sum_k exp(s_k)), ..., exp(s_N)/(sum_k exp(s_k))]

The distribution of the true label t, which we've denoted so far by Q is thus given by

Q(s_t)=exp(s_t)/(sum_k exp(s_k))

and this brings us back to the loss which can be expressed as

Loss= - log(Q_t) = - log (exp(s_t)/(sum_k exp(s_k))

which is the one you've given in your problem. In your question, x_y is therefore the scores that the network outputs for the correct label that is associated with x.

Thank you very much for your clarification, that was really helpful :) — spadel, Jun 19 '20 at 16:32

Truly understanding Cross-Entropy-Loss

1 Answers1