Metric for evaluating predicted bounding boxes from semantic segmentation on an object level outside of training

Question

Context

For simplicity let us pretend we are performing semantic segmentation on a series of one pixel high images of width w with three channels (r, g, b) with n label classes.

In other words, a single image might look like:

img = [
    [r1, r2, ..., rw], # channel r
    [g1, g2, ..., gw], # channel g
    [b1, b2, ..., bw], # channel b
]

and have dimensions [3, w].

then for a given image with w=10 and n=3 its labels ground truth might be:

# ground "truth"
target = np.array([
  #0     1     2     3     4     5     6     7     8     9      # position
  [0,    1,    1,    1,    0,    0,    1,    1,    1,    1],    # class 1
  [0,    0,    0,    0,    1,    1,    1,    1,    0,    0],    # class 2
  [1,    0,    0,    0,    0,    0,    0,    0,    0,    0],    # class 3
])

and our model might predict as output:

# prediction
output = np.array([
  #0     1     2     3     4     5     6     7     8     9      # position
  [0.11, 0.71, 0.98, 0.95, 0.20, 0.15, 0.81, 0.82, 0.95, 0.86], # class 1
  [0.13, 0.17, 0.05, 0.42, 0.92, 0.89, 0.93, 0.93, 0.67, 0.21], # class 2
  [0.99, 0.33, 0.20, 0.12, 0.15, 0.15, 0.20, 0.01, 0.02, 0.13], # class 3
])

for further simplicity, let us transform our model's output by binarizing it with a cutoff of 0.9

# binary mask with cutoff 0.9
b_mask = np.array([
  #0     1     2     3     4     5     6     7     8     9      # position
  [0,    0,    1,    1,    0,    0,    0,    0,    1,    0],    # class 1
  [0,    0,    0,    0,    1,    0,    1,    1,    0,    0],    # class 2
  [1,    0,    0,    0,    0,    0,    0,    0,    0,    0],    # class 3
])

Then if we were to look at the "objects" of each class the bounding boxes (or in this case just boundaries i.e. [start, stop] pixels) our predicted objects from the binary mask "introduce" an object:

# "detected" objects
p_obj = [
  [[2, 3], [8, 8]],  # class 1
  [[4, 4], [6, 7]],  # class 2
  [[0, 0]]           # class 3
]

compared to the objects of the ground truth:

# true objects
t_obj = [
  [[1, 3], [6, 9]],  # class 1
  [[4, 7]],          # class 2
  [[0, 0]]           # class 3
]

Question

If I wanted a metric to describe the accuracy of the boundaries, on average, per object, what would be the appropriate metric?

I understand IOU in the context of training a model which predicts bounding boxes, e.g. it is an object to object comparison, but what should one do when one object might be fragmented into several?

Goal

I would like a metric that, per class, gives me something like this:

class 1: [-1, 2]  # bounding boxes for class one, on average start one
                  # pixel before they should and end two pixels after 
                  # they should

class 2: [ 0, 3]  # bounding boxes for class two, on average start 
                  # exactly where they should and end three pixels  
                  # after they should

class 3: [ 3, -1] # bounding boxes for class three, on average start 
                  # three pixels after where they begin and end one 
                  # pixels too soon

but I am not sure how to best approach this when a single object is fragmented into several...

score 0 · Answer 1 · answered Nov 21 '18 at 20:38

Assumption

You ask specifically about the 1D case, so we will solve the 1D case here, but the method is essentially the same for 2D.

Let us assume you have two ground truth bounding boxes: box 1 and box 2.

Further, let us assume that our model is not so great and predicts more than 2 boxes (maybe it found something new, maybe it broke one box into two).

For this demonstration let us consider that this is what we are working with:

# labels
# box 1: x----y 
# box 2: x++++y
# 0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20
#             x--------y        x+++++++++++++++++++++++++++++y     TRUTH
#             a-----------b                                         PRED 1, BOX 1
#                   a+++++++++++++++++b                             PRED 2, BOX 2
#                a++++++++++++++++++++++++++++++++b                 PRED 3, BOX 2

Core Problem

What you want is in effect a score on the alignment of your predictions to the targets.... but oh no! which targets belong to which predictions?

Pick your distance function of choice and pair each prediction with a target based on that function. In this case I will use a modified intersection over union (IOU) for the 1D case. I chose this function as I wanted both PRED 2 and 3 from the above diagram to align to box 2.

With a score for each prediction, pair it with the target that produced the best score.

Now with a one-to-one prediction-target pair, calculate whatever it is that you want.

Demo with above assumption

from the above assumptions:

pred_boxes = [
    [4,  8],
    [6, 12],
    [5, 16]
]

true_boxes = [
    [4,   7],
    [10, 20]
]

a 1d version of intersection over union:

def iou_1d(predicted_boundary, target_boundary):
  '''Calculates the intersection over union (IOU) based on a span.

  Notes:
    boundaries are provided in the the form of [start, stop].
    boundaries where start = stop are accepted
    boundaries are assumed to be only in range [0, int < inf)

  Args:
    predicted_boundary (list): the [start, stop] of the predicted boundary
    target_boundary (list): the ground truth [start, stop] for which to compare

  Returns:
    iou (float): the IOU bounded in [0, 1]
  '''

  p_lower, p_upper = predicted_boundary
  t_lower, t_upper = target_boundary

  # boundaries are in form [start, stop] and 0<= start <= stop
  assert 0<= p_lower <= p_upper
  assert 0<= t_lower <= t_upper

   # no overlap, pred is too far left or pred is too far right
  if p_upper < t_lower or p_lower > t_upper:
    return 0

  if predicted_boundary == target_boundary:
    return 1

  intersection_lower_bound = max(p_lower, t_lower)
  intersection_upper_bound = min(p_upper, t_upper)


  intersection = intersection_upper_bound - intersection_lower_bound
  union = max(t_upper, p_upper) - min(t_lower, p_lower)  
  union = union if union != 0 else 1  
  return min(intersection / union, 1)

some simple helpers:

from math import sqrt
def euclidean(u, v):
  return sqrt((u[0]-v[0])**2 + (u[1]-v[1])**2)

def mean(arr):
  return sum(arr) / len(arr)

how we align our boundaries:

def align_1d(predicted_boundary, target_boundaries, alignment_scoring_fn=iou_1d, take=max):
  '''Aligns predicted_bondary to the closest target_boundary based on the 
    alignment_scoring_fn

  Args:
    predicted_boundary (list): the predicted boundary in form of [start, stop]

    target_boundaries (list): a list of all valid target boundaries each having
      form [start, stop]

    alignment_scoring_fn (function): a function taking two arguments each of 
      which is a list of two elements, the first assumed to be the predicted
      boundary and the latter the target boundary. Should return a single number.

    take (function): should either be min or max. Selects either the highest or
      lower score according to the alignment_scoring_fn

  Returns:
    aligned_boundary (list): the aligned boundary in form [start, stop]
  '''
  scores = [
      alignment_scoring_fn(predicted_boundary, target_boundary) 
      for target_boundary in target_boundaries
  ]



  # boundary did not align to any boxes, use fallback scoring mechanism to break
  # tie
  if not any(scores):
    scores = [
      1 / euclidean(predicted_boundary, target_boundary)
      for target_boundary in target_boundaries
    ]

  aligned_index = scores.index(take(scores))
  aligned = target_boundaries[aligned_index]
  return aligned

how we calculate difference:

def diff(u, v):
  return [u[0] - v[0], u[1] - v[1]]

combine it all into one:

def aligned_distance_1d(predicted_boundaries, target_boundaries, alignment_scoring_fn=iou_1d, take=max, distance_fn=diff, aggregate_fn=mean):
  '''Returns the aggregated distance of predicted boundings boxes to their aligned bounding box based on alignment_scoring_fn and distance_fn

  Args:
    predicted_boundaries (list): a list of all valid target boundaries each 
      having form [start, stop]

    target_boundaries (list): a list of all valid target boundaries each having
      form [start, stop]

    alignment_scoring_fn (function): a function taking two arguments each of 
      which is a list of two elements, the first assumed to be the predicted
      boundary and the latter the target boundary. Should return a single number.

    take (function): should either be min or max. Selects either the highest or
      lower score according to the alignment_scoring_fn

    distance_fn (function): a function taking two lists and should return a
      single value.

    aggregate_fn (function): a function taking a list of numbers (distances 
      calculated by distance_fn) and returns a single value (the aggregated 
      distance)

  Returns:
    aggregated_distnace (float): return the aggregated distance of the 
      aligned predicted_boundaries

      aggregated_fn([distance_fn(pair) for pair in paired_boundaries(predicted_boundaries, target_boundaries)])
  '''


  paired = [
      (predicted_boundary, align_1d(predicted_boundary, target_boundaries, alignment_scoring_fn))
      for predicted_boundary in predicted_boundaries
  ]
  distances = [distance_fn(*pair) for pair in paired]
  aggregated = [aggregate_fn(error) for error in zip(*distances)]
  return aggregated

run:

aligned_distance_1d(pred_boxes, true_boxes)

# [-3.0, -3.6666666666666665]

Note, for many predictions and many targets there are many ways to optimize the code. Here, I broke up the main functional chunks so it is clear what is going on.

Now does this make sense? Well since I wanted pred 2 and 3 to align to box 2, yes, both starts are prior the truth and both end prematurely.

Solution to question asked

copy pasted your examples:

# "detected" objects
p_obj = [
  [[2, 3], [8, 8]],  # class 1
  [[4, 4], [6, 7]],  # class 2
  [[0, 0]]           # class 3
] 

# true objects
t_obj = [
  [[1, 3], [6, 9]],  # class 1
  [[4, 7]],          # class 2
  [[0, 0]]           # class 3
]

since you know the boxes per class this is easy:

[
    aligned_distance_1d(p_obj[cls_no], t_obj[cls_no])
    for cls_no in range(len(t_obj))
]


# [[1.5, -0.5], [1.0, -1.5], [0.0, 0.0]]

Does this output make sense?

Starting with a sanity check, let us look at class 3. The average distance of [start, stop] are both 0. Makes sense.

How about class 1? both predictions start too late (2 > 1, 8 > 6) but only one ends too soon (8 < 9). So makes sense.

Now let us look at class 2, which is why it seems you asked the question (more predictions than targets).

If we were to draw what the score suggests it would be:

#  0  1  2  3  4  5  6  7  8  9
#              ----------        # truth [4, 7]
#                 ++             # pred  [4 + 1, 7 - 1.5]

It doesn't look so great, but this is just an example...

Does this make sense? Yes / no. Yes in terms of how we calculated the metric. One stoped 3 values too soon the other started 2 too late. No in the sense that neither of your predictions actually cover the value 5, and yet this metric leads you to believe that is the case...

Conclusion

Is this a faulty metric?

Depends on what you are using it for / trying to show. However since you use a binary mask to generate you predicted boundaries, that is a non negligible root of this problem. Perhaps there is a better strategy to get boundaries from your label probabilities.