18

When people try to solve the task of semantic segmentation with CNN's they usually use a softmax-crossentropy loss during training (see Fully conv. - Long). But when it comes to comparing the performance of different approaches measures like intersection-over-union are reported.

My question is why don't people train directly on the measure they want to optimize? Seems odd to me to train on some measure during training, but evaluate on another measure for benchmarks.

I can see that the IOU has problems for training samples, where the class is not present (union=0 and intersection=0 => division zero by zero). But when I can ensure that every sample of my ground truth contains all classes, is there another reason for not using this measure?

zimmermc
  • 623
  • 2
  • 6
  • 13

5 Answers5

17

Checkout this paper where they come up with a way to make the concept of IoU differentiable. I implemented their solution with amazing results!

mathetes
  • 11,766
  • 7
  • 25
  • 32
  • 7
    Might want to include some of the math here to make this not a link only answer. Their algorithm looks similar to [Y.Wang et al](http://www.cs.umanitoba.ca/~ywang/papers/isvc16.pdf). Roughly, `I ~= sum(Y*Y')` and `U ~= sum(Y + Y' - Y*Y')`. Your paper uses the negative log of `I/U` and the one I linked uses `1-I/U`. I like the negative log form but I'm going to try both soon. Yours also sums after the I/U instead of before. – Poik Oct 11 '17 at 22:18
  • 1
    In my personal opinion, it is more physically sensible to calculate `I/U` for each sample in the training set, and then performing the summation. In this way, you evaluate the accuracy on a per-sample basis, and individual errors add up. Performing the summation first may lead to error cancellation (`sum(I)/sum(U)` may give a good score, while `sum(I/U)` may not for the same data). I am by no means an expert, though... – MPA Jan 03 '18 at 16:08
  • @mathetes I am looking into pixelwise semantic segmentation problem as well for binary classification. I looked through some papers, but unclear on some things: 1) [Mattyus et al.](https://www.cs.toronto.edu/~urtasun/publications/mattyus_etal_iccv17.pdf) shows different soft-IOU, when I tried to binarize it, it seems different than the given impl. in your paper. 2) Does weighted cross entropy loss achieves the same goal as soft iou loss. – Naman Apr 08 '18 at 02:59
  • @Poik in my case I implemented the loss in the paper you provided, and both IoU and dice coefficient are really high(>0.92) but the predictions are really bad, escencially constant 0s, any suggestions? – Luis Leal Jul 04 '22 at 19:48
11

It is like asking "why for classification we train log loss and not accuracy?". The reason is really simple - you cannot directly train for most of the metrics, because they are not differentiable wrt. to your parameters (or at least do not produce nice error surface). Log loss (softmax crossentropy) is a valid surrogate for accuracy. Now you are completely right that it is plain wrong to train with something that is not a valid surrogate of metric you are interested in, and the linked paper does not do a good job since for at least a few metrics they are considering - we could easily show good surrogate (like for weighted accuracy all you have to do is weight log loss as well).

lejlot
  • 64,777
  • 8
  • 131
  • 164
4

Here's another way to think about this in a simple manner.

Remember that it is not sufficient to simply evaluate a metric such as accuracy or IoU while solving a relevant image problem. Evaluating the metric must also help the network learn in which direction the weights must be nudged towards, so that a network can learn effectively over iterations and epochs.

Evaluating this direction is what the earlier comments mean that the errors are differentiable. I suppose that there is nothing about the IoU metrics that the network can use to say: "hey, it's not exactly here, but I have to maybe move my bounding box a little to the left!"

Just a trickle of an explanation, but hope it helps..

apil.tamang
  • 2,545
  • 7
  • 29
  • 40
3

I always use mean IOU for training a segmentation model. More exactly, -log(MIOU). Plain -MIOU as a loss function will easily trap your optimizer around 0 because of its narrow range (0,1) and thus its steep surface. By taking its log scale, the loss surface becomes slow and good for training.

Tae-Sung Shin
  • 20,215
  • 33
  • 138
  • 240
  • Tae-Sung Shin, thanks for the suggestion. I'm still finding the loss to go to 0. Should the learning rate be adjusted as well? I'm using Adam optimizer. – weather guy Dec 10 '21 at 14:45
  • The validation loss for -log(MIOU) was 0 from the get-go, a little higher for training loss (0.03). This is a classification problem, with very high class imbalance. However, binary crossentropy loss works fine and converges. – weather guy Dec 10 '21 at 21:23
0

The main reason is that IoU is region-based, meaning that if your TP threshold is 0.5, A pixel predicted with a 0.99 probability is the same as that predicted with a 0.51 probability. This is not ideal if we want to minimize the loss to find a more confident model. Cross entropy loss will account for this difference in the model's confidence because it is probabilistic-based.

Y YANG
  • 1
  • 2