how is total loss calculated over multiple classes in Keras?

Question

Let's say I have network with following params:

fully convolutional network for semantic segmentation
loss = weighted binary cross entropy (but it could be any loss function, doesn't matter)
5 classes - inputs are images and ground truths are binary masks
Batch size = 16

Now, I know that the loss is calculated in the following manner: binary cross entropy is applied to each pixel in the image with regards to each class. So essentially, each pixel will have 5 loss values

What happens after this step?

When I train my network, it prints only a single loss value for an epoch. There are many levels of loss accumulation that need to happen to produce a single value and how it happens is not clear at all in the docs/code.

What gets combined first - (1) the loss values of the class(for instance 5 values(one for each class) get combined per pixel) and then all the pixels in the image or (2)all the pixels in the image for each individual class, then all the class losses are combined?
How exactly are these different pixel combinations happening - where is it being summed / where is it being averaged?
Keras's binary_crossentropy averages over axis=-1. So is this an average of all the pixels per class or average of all the classes or is it both??

To state it in a different way: how are the losses for different classes combined to produce a single loss value for an image?

This is not explained in the docs at all and would be very helpful for people doing multi-class predictions on keras, regardless of the type of network. Here is the link to the start of keras code where one first passes in the loss function.

The closest thing I could find to an explanation is

loss: String (name of objective function) or objective function. See losses. If the model has multiple outputs, you can use a different loss on each output by passing a dictionary or a list of losses. The loss value that will be minimized by the model will then be the sum of all individual losses

from keras. So does this mean that the losses for each class in the image is simply summed?

Example code here for someone to try it out. Here's a basic implementation borrowed from Kaggle and modified for multi-label prediction:

# Build U-Net model
num_classes = 5
IMG_DIM = 256
IMG_CHAN = 3
weights = {0: 1, 1: 1, 2: 1, 3: 1, 4: 1000} #chose an extreme value just to check for any reaction
inputs = Input((IMG_DIM, IMG_DIM, IMG_CHAN))
s = Lambda(lambda x: x / 255) (inputs)

c1 = Conv2D(8, (3, 3), activation='relu', padding='same') (s)
c1 = Conv2D(8, (3, 3), activation='relu', padding='same') (c1)
p1 = MaxPooling2D((2, 2)) (c1)

c2 = Conv2D(16, (3, 3), activation='relu', padding='same') (p1)
c2 = Conv2D(16, (3, 3), activation='relu', padding='same') (c2)
p2 = MaxPooling2D((2, 2)) (c2)

c3 = Conv2D(32, (3, 3), activation='relu', padding='same') (p2)
c3 = Conv2D(32, (3, 3), activation='relu', padding='same') (c3)
p3 = MaxPooling2D((2, 2)) (c3)

c4 = Conv2D(64, (3, 3), activation='relu', padding='same') (p3)
c4 = Conv2D(64, (3, 3), activation='relu', padding='same') (c4)
p4 = MaxPooling2D(pool_size=(2, 2)) (c4)

c5 = Conv2D(128, (3, 3), activation='relu', padding='same') (p4)
c5 = Conv2D(128, (3, 3), activation='relu', padding='same') (c5)

u6 = Conv2DTranspose(64, (2, 2), strides=(2, 2), padding='same') (c5)
u6 = concatenate([u6, c4])
c6 = Conv2D(64, (3, 3), activation='relu', padding='same') (u6)
c6 = Conv2D(64, (3, 3), activation='relu', padding='same') (c6)

u7 = Conv2DTranspose(32, (2, 2), strides=(2, 2), padding='same') (c6)
u7 = concatenate([u7, c3])
c7 = Conv2D(32, (3, 3), activation='relu', padding='same') (u7)
c7 = Conv2D(32, (3, 3), activation='relu', padding='same') (c7)

u8 = Conv2DTranspose(16, (2, 2), strides=(2, 2), padding='same') (c7)
u8 = concatenate([u8, c2])
c8 = Conv2D(16, (3, 3), activation='relu', padding='same') (u8)
c8 = Conv2D(16, (3, 3), activation='relu', padding='same') (c8)

u9 = Conv2DTranspose(8, (2, 2), strides=(2, 2), padding='same') (c8)
u9 = concatenate([u9, c1], axis=3)
c9 = Conv2D(8, (3, 3), activation='relu', padding='same') (u9)
c9 = Conv2D(8, (3, 3), activation='relu', padding='same') (c9)

outputs = Conv2D(num_classes, (1, 1), activation='sigmoid') (c9)

model = Model(inputs=[inputs], outputs=[outputs])
model.compile(optimizer='adam', loss=weighted_loss(weights), metrics=[mean_iou])

def weighted_loss(weightsList):
    def lossFunc(true, pred):

        axis = -1 #if channels last 
        #axis=  1 #if channels first        
        classSelectors = K.argmax(true, axis=axis) 
        classSelectors = [K.equal(tf.cast(i, tf.int64), tf.cast(classSelectors, tf.int64)) for i in range(len(weightsList))]
        classSelectors = [K.cast(x, K.floatx()) for x in classSelectors]
        weights = [sel * w for sel,w in zip(classSelectors, weightsList)] 

        weightMultiplier = weights[0]
        for i in range(1, len(weights)):
            weightMultiplier = weightMultiplier + weights[i]

        loss = BCE_loss(true, pred) - (1+dice_coef(true, pred))
        loss = loss * weightMultiplier
        return loss
    return lossFunc
model.summary()

The actual BCE-DICE loss function can be found here.

Motivation for the question: Based on the above code, the total validation loss of the network after 20 epochs is ~1%; however, the mean intersection over union scores for the first 4 classes are above 95% each, but for the last class its 23%. Clearly indicating that the 5th class isn't doing well at all. However, this loss in accuracy isn't being reflected at all in the loss. Hence, that means the individual losses for the sample are being combined in a way that completely negates the huge loss we see for the 5th class. And, so when the per sample losses are being combined over batch, it's still really low. I'm not sure how to reconcile this information.

As far as I know the loss function is calculated on the whole batch and returns a tensor of dimension (batch_size,). So before keras does its behind the curtain magic, (which I assume is simply averaging over the batch_dimension) you should have 16 and not 10 losses from your loss function. — dennis-w, Aug 27 '18 at 08:26
Right, I agree with that as well. So, then that must mean the loss from the different classes are in some way combined per image to produce a single loss value per image. How are these loss values combined? @dennis-ec — Jonathan, Aug 27 '18 at 08:46
Using K.mean( . . ., axis=-1) as you can see here: https://github.com/keras-team/keras/blob/master/keras/losses.py — dennis-w, Aug 27 '18 at 08:49
That line confuses me because for something like BCE being applied to semantic segmentation, it means that BCE loss is applied to each pixel, and then the loss for all the pixels is averaged to produce a single loss per class. This is what I thought that `K.mean(...)` line is referring to. So is it that it's being averaged over all the pixels and then again over all the classes? @dennis-ec — Jonathan, Aug 27 '18 at 08:52
Perhaps another way to think about it is what if I just chose to use a custom loss function, let's say intersection over union. This loss produces a single value per class per image. So, then how does the loss value for IoU get combined over all the classes? @dennis-ec — Jonathan, Aug 27 '18 at 08:55
I don't know what you are meaning with pixel. I assumed your network output has the shape (16, 10). Your labels should have the same shape so K.binary_crossentropy gives you per image per classes losses which are then averaged by K.mean to per image losses. And my assumption is that this resulting tensor will also be averaged outside of keras.losses to get your single loss value. — dennis-w, Aug 27 '18 at 09:01
I am running a FCN for sematic segmentation, which means I'm doing image segmentation using a neural network. So, every pixel becomes an input and the output is a prediction of what class every pixel is. Does that make sense? So my network output (if my input image is RGB - 512x512x3) is 512x512x10 if I have 10 classes. — Jonathan, Aug 27 '18 at 09:03
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/178841/discussion-between-dennis-ec-and-jonathan). — dennis-w, Aug 27 '18 at 09:07

today · Accepted Answer · 2018-09-08T19:50:11.770

Although I have already mentioned part of this answer in a related answer, but let's inspect the source code step-by-step with more details to find the answer concretely.

First, Let's feedforward(!): there is a call to weighted_loss function which takes y_true, y_pred, sample_weight and mask as inputs:

weighted_loss = weighted_losses[i]
# ...
output_loss = weighted_loss(y_true, y_pred, sample_weight, mask)

weighted_loss is actually an element of a list which contains all the (augmented) loss functions passed to fit method:

weighted_losses = [
    weighted_masked_objective(fn) for fn in loss_functions]

The "augmented" word I mentioned is important here. That's because, as you can see above, the actual loss function is wrapped by another function called weighted_masked_objective which has been defined as follows:

def weighted_masked_objective(fn):
    """Adds support for masking and sample-weighting to an objective function.
    It transforms an objective function `fn(y_true, y_pred)`
    into a sample-weighted, cost-masked objective function
    `fn(y_true, y_pred, weights, mask)`.
    # Arguments
        fn: The objective function to wrap,
            with signature `fn(y_true, y_pred)`.
    # Returns
        A function with signature `fn(y_true, y_pred, weights, mask)`.
    """
    if fn is None:
        return None

    def weighted(y_true, y_pred, weights, mask=None):
        """Wrapper function.
        # Arguments
            y_true: `y_true` argument of `fn`.
            y_pred: `y_pred` argument of `fn`.
            weights: Weights tensor.
            mask: Mask tensor.
        # Returns
            Scalar tensor.
        """
        # score_array has ndim >= 2
        score_array = fn(y_true, y_pred)
        if mask is not None:
            # Cast the mask to floatX to avoid float64 upcasting in Theano
            mask = K.cast(mask, K.floatx())
            # mask should have the same shape as score_array
            score_array *= mask
            #  the loss per batch should be proportional
            #  to the number of unmasked samples.
            score_array /= K.mean(mask)

        # apply sample weighting
        if weights is not None:
            # reduce score_array to same ndim as weight array
            ndim = K.ndim(score_array)
            weight_ndim = K.ndim(weights)
            score_array = K.mean(score_array,
                                 axis=list(range(weight_ndim, ndim)))
            score_array *= weights
            score_array /= K.mean(K.cast(K.not_equal(weights, 0), K.floatx()))
        return K.mean(score_array)
return weighted

So, there is a nested function, weighted, that actually calls the real loss function fn in the line score_array = fn(y_true, y_pred). Now, to be concrete, in case of the example the OP provided, the fn (i.e. loss function) is binary_crossentropy. Therefore we need to take a look at the definition of binary_crossentropy() in Keras:

def binary_crossentropy(y_true, y_pred):
    return K.mean(K.binary_crossentropy(y_true, y_pred), axis=-1)

which in turn, calls the backend function K.binary_crossentropy(). In case of using Tensorflow as the backend, the definition of K.binary_crossentropy() is as follows:

def binary_crossentropy(target, output, from_logits=False):
    """Binary crossentropy between an output tensor and a target tensor.
    # Arguments
        target: A tensor with the same shape as `output`.
        output: A tensor.
        from_logits: Whether `output` is expected to be a logits tensor.
            By default, we consider that `output`
            encodes a probability distribution.
    # Returns
        A tensor.
    """
    # Note: tf.nn.sigmoid_cross_entropy_with_logits
    # expects logits, Keras expects probabilities.
    if not from_logits:
        # transform back to logits
        _epsilon = _to_tensor(epsilon(), output.dtype.base_dtype)
        output = tf.clip_by_value(output, _epsilon, 1 - _epsilon)
        output = tf.log(output / (1 - output))

    return tf.nn.sigmoid_cross_entropy_with_logits(labels=target,
                                                   logits=output)

The tf.nn.sigmoid_cross_entropy_with_logits returns:

A Tensor of the same shape as logits with the componentwise logistic losses.

Now, let's backpropagate(!): considering the above note, the output shape of K.binray_crossentropy would be the same as y_pred (or y_true). As the OP mentioned, y_true has a shape of (batch_size, img_dim, img_dim, num_classes). Therefore, the K.mean(..., axis=-1) is applied over a tensor of shape (batch_size, img_dim, img_dim, num_classes) which results in an output tensor of shape (batch_size, img_dim, img_dim). So the loss values of all classes are averaged for each pixel in the image. Hence, the shape of score_array in weighted function mentioned above would be (batch_size, img_dim, img_dim). There is one more step: the return statement in weighted function takes the mean again i.e. return K.mean(score_array). So how does it compute the mean? If you take a look at the definition of mean backend function you would find out that the axis argument is None by default:

def mean(x, axis=None, keepdims=False):
    """Mean of a tensor, alongside the specified axis.
    # Arguments
        x: A tensor or variable.
        axis: A list of integer. Axes to compute the mean.
        keepdims: A boolean, whether to keep the dimensions or not.
            If `keepdims` is `False`, the rank of the tensor is reduced
            by 1 for each entry in `axis`. If `keepdims` is `True`,
            the reduced dimensions are retained with length 1.
    # Returns
        A tensor with the mean of elements of `x`.
    """
    if x.dtype.base_dtype == tf.bool:
        x = tf.cast(x, floatx())
return tf.reduce_mean(x, axis, keepdims)

And it calls the tf.reduce_mean() which given an axis=None argument, takes the mean over all the axes of input tensor and return one single value. Therefore, the mean of the whole tensor of shape (batch_size, img_dim, img_dim) is computed, which translates to taking the average over all the labels in the batch and over all their pixels, and is returned as one single scalar value which represents the loss value. Then, this loss value is reported back by Keras and is used for optimization.

Bonus: what if our model has multiple output layers and therefore multiple loss functions are used?

Remember the first piece of code I mentioned in this answer:

weighted_loss = weighted_losses[i]
# ...
output_loss = weighted_loss(y_true, y_pred, sample_weight, mask)

As you can see there is an i variable which is used for indexing the array. You may have guessed correctly: it is actually part of a loop which computes the loss value for each output layer using its designated loss function and then takes the (weighted) sum of all these loss values to compute the total loss:

# Compute total loss.
total_loss = None
with K.name_scope('loss'):
    for i in range(len(self.outputs)):
        if i in skip_target_indices:
            continue
        y_true = self.targets[i]
        y_pred = self.outputs[i]
        weighted_loss = weighted_losses[i]
        sample_weight = sample_weights[i]
        mask = masks[i]
        loss_weight = loss_weights_list[i]
        with K.name_scope(self.output_names[i] + '_loss'):
            output_loss = weighted_loss(y_true, y_pred,
                                        sample_weight, mask)
        if len(self.outputs) > 1:
            self.metrics_tensors.append(output_loss)
            self.metrics_names.append(self.output_names[i] + '_loss')
        if total_loss is None:
            total_loss = loss_weight * output_loss
        else:
            total_loss += loss_weight * output_loss
    if total_loss is None:
        if not self.losses:
            raise ValueError('The model cannot be compiled '
                                'because it has no loss to optimize.')
        else:
            total_loss = 0.

    # Add regularization penalties
    # and other layer-specific losses.
    for loss_tensor in self.losses:
        total_loss += loss_tensor

Thank you for your answer. Couple questions: if a tensor of shape `(batch_size, img_dim, img_dim, num_classes)` is being passed in, then how am I able to multiply the loss by a weight (see the weighted_loss()` in the code I posted. And secondly, can you please elaborate on "loss values of all classes are averaged for each pixel in the image?" Does that mean the pixels are first averaged per class, then these are again averaged for all the classes? — Jonathan, Sep 06 '18 at 19:14
Also, to confirm, if `(batch_size, img_dim, img_dim, num_classes)` is being passed into the loss function and I decide to use only 'dice_coef = (2*intersection)/union` for loss, that would mean the intersection & union are calculated for all the classes at the same time, correct? In essence, when I calculate the intersection - when I'm doing `pred * true`, I'm actually doing `(batch_size, img_dim, img_dim, num_classes) * (batch_size, img_dim, img_dim, num_classes)`. — Jonathan, Sep 06 '18 at 19:18
@Jonathan Note that your custom loss function **can** return one single value. It is not necessary to return a 2D or 3D tensor as the loss. But keep in mind that `return K.mean(score_array)` line is executed and no matter what the shape of the output of your custom loss function (i.e. `score_array`) is, it takes an average of all the elements in that and returns **one single value**. Now, if you would like to control and customize the averaging and weighting operations, you need to perform all of these in your custom loss function (i.e. `weighted_loss`) and return one single value. >>> — today, Sep 06 '18 at 19:44
@Jonathan >> This way you are making sure that the loss is computed the way you intend it to be computed and therefore you have full control over the computation of loss value. — today, Sep 06 '18 at 19:44
@Jonathan "Does that mean the pixels are first averaged per class, ...": No, in that stage, the loss values for each individual pixel which has a shape of `(1, 1, 1, num_classes)` are averaged. Therefore, the output shape of the result in that stage would be `(batch_size, img_dim, img_dim)`. Essentially the last axis (i.e. classes axis) is removed since the averaging was done over that axis. And now each pixel in the resulting tensor contains the average of loss values computed over all the classes in that pixel. — today, Sep 06 '18 at 19:54
I see, so after `binary_crossentopy()` returns a tensor of size `(batch_size, img_dim, img_dim)`, this is averaged again by `return K.mean(score_array)`. So technically speaking all the pixels in that entire tensor are summed and then divided by the `batch_size x img_dim x img_dim`. Correct? — Jonathan, Sep 06 '18 at 20:15

BlueKryptonite · Answer 2 · 2018-08-30T22:20:25.890

1) What gets combined first - (1) the loss values of the class(for instance 10 values(one for each class) get combined per pixel) and
then all the pixels in the image or (2)all the pixels in the image for each individual class, then all the class losses are combined? 2) How exactly are these different pixel combinations happening - where is it being summed / where is it being averaged?

My answer for (1): When training a batch of images, an array consisting of pixel values is trained by calculating the non-linear function, loss and optimizing (updating the weights). The loss is not calculated for each pixel value; rather, it is done for each image.

The pixel values (X_train), weights and bias (b) are used in a sigmoid (for the simplest example of non-linearity) to calculate the predicted y value. This, along with the y_train (a batch at a time) is used to calculate the loss, which is optimized using one of the optimization methods like SGD, momentum, Adam, etc to update the weights and biases.

My answer for (2): During the non-linearity operation, the pixel values (X_train) are combined with the weights (through a dot product) and added to bias to form a predicted target value.

In a batch, there may be training examples belonging to different classes. The corresponding target values (for each class) are compared with the corresponding predicted values to compute the loss. These are Therefore, it is perfectly fine to sum all the losses.

It really doesn't matter if they belong to one class or multiple classes as long as you compare it with a corresponding target of the correct class. Make sense?

This answer is wrong when it comes to FCNs. What you say is true for a regular CNN, but this is in fact pixel wise prediction to produce semantic segmentation. Therefore the loss is in fact applied at a pixel level. And all the other information you stated is just general knowledge and doesn't have much to do with the question. — Jonathan, Aug 31 '18 at 18:43
The essential idea is not too far different from CNN. Nevertheless, you might find this article useful to get a basic idea of the concept. https://medium.com/100-shades-of-machine-learning/https-medium-com-100-shades-of-machine-learning-rediscovering-semantic-segmentation-part1-83e1462e0805 — BlueKryptonite, Sep 01 '18 at 03:14

how is total loss calculated over multiple classes in Keras?

2 Answers2

Linked