0

Motivation:

I'm using scipy's python implementation of hungarian algorithm (https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.linear_sum_assignment.html#scipy-optimize-linear-sum-assignment) for matching two sets of time events. So far so good, but I have a time difference limit (let's say 0.1s) for matching these events and I don't want any matches which would have time difference above this limit.

An example without any time difference limit:

from dataclasses import dataclass
from scipy.optimize import linear_sum_assignment

@dataclass(frozen=True)
class Event:
    t: float

expected = [Event(0), Event(1), Event(2)]
detected = [          Event(1), Event(2), Event(3)] # shifted to show
                                                    # time matching

def tdiff(e1: Event, e2: Event) -> float:
    return abs(e1.t - e2.t)

cost_matrix = [
    [
        tdiff(e1, e2)
        for e2 in detected
    ]
    for e1 in expected
]
print('cost matrix:')
for row in cost_matrix:
    print('[', ', '.join(map(str, row)), ']')

row_idx, col_idx = linear_sum_assignment(cost_matrix)
for i, j in zip(row_idx, col_idx):
    exp_ev = expected[i]
    det_ev = detected[j]
    td = cost_matrix[i][j]
    print('matches %s and %s, tdiff: %.3f' % (exp_ev, det_ev, td))

This prints:

cost matrix:
[ 1, 2, 3 ]
[ 0, 1, 2 ]
[ 1, 0, 1 ]
matches Event(t=0) and Event(t=1), tdiff: 1.000
matches Event(t=1) and Event(t=2), tdiff: 1.000
matches Event(t=2) and Event(t=3), tdiff: 1.000

This is clearly not what I want, because all matches are above the limit (0.1s).

What have I tried:

My solution is to artificially increase cost of "unwanted" matching in the cost matrix, and then I filter matches to return only "wanted" matches:

from dataclasses import dataclass
from scipy.optimize import linear_sum_assignment

@dataclass(frozen=True)
class Event:
    t: float

TDIFF_MAX = 0.1
expected = [Event(0), Event(1), Event(2)]
detected = [          Event(1), Event(2), Event(3)] # shifted to show
                                                    # time matching

def tdiff(e1: Event, e2: Event) -> float:
    return abs(e1.t - e2.t)

orig_cost_matrix = [
    [
        tdiff(e1, e2)
        for e2 in detected
    ]
    for e1 in expected
]
orig_max = max(map(max, orig_cost_matrix))

print('original cost matrix:')
for row in orig_cost_matrix:
    print('[', ', '.join(map(str, row)), ']')

cost_matrix = [
    [
        x if x <= TDIFF_MAX else orig_max + 1
        for x in row
    ]
    for row in orig_cost_matrix
]
print('modified cost matrix:')
for row in cost_matrix:
    print('[', ', '.join(map(str, row)), ']')

row_idx, col_idx = linear_sum_assignment(cost_matrix)
for i, j in zip(row_idx, col_idx):
    exp_ev = expected[i]
    det_ev = detected[j]
    td = cost_matrix[i][j]
    if td <= TDIFF_MAX:
        print('matches %s and %s, tdiff: %.3f' % (exp_ev, det_ev, td))

Output:

original cost matrix:
[ 1, 2, 3 ]
[ 0, 1, 2 ]
[ 1, 0, 1 ]
modified cost matrix:
[ 4, 4, 4 ]
[ 0, 4, 4 ]
[ 4, 0, 4 ]
matches Event(t=1) and Event(t=1), tdiff: 0.000
matches Event(t=2) and Event(t=2), tdiff: 0.000

This works fine, but I don't like this solution for 2 reasons:

  1. it looks ugly and is harder to understand
  2. in "real" application the expected and detected sets can be quite large, and since time difference limit is rather small, most of the matches are "unwanted", which makes me think this is rather inefficient and wastes a lot of CPU.

Question:

Is there abetter way to "disable" some possible matches (not necessary using hungarian algoritm..), which would work better in cases with large data sets where each Event has mostly only 1, 2 or 3 possible matches?

I'm mostly interested in algorithms, so feel free to post solutions in other languages than python or even in pseudocode.

Jan Spurny
  • 5,219
  • 1
  • 33
  • 47
  • Before the `scipy` version was added, there was a Python implementation by `munkres`, https://stackoverflow.com/questions/4075669/hungarian-algorithm-in-python. It's pure python, so is relatively slow. Python isn't the best for search types of problems. – hpaulj Feb 01 '23 at 16:40
  • Initially the `scipy` code was the munkres implementation, which you can see in early scipy v 0.18 documentation and [source]. https://github.com/scipy/scipy/blob/v0.18.1/scipy/optimize/_hungarian.py#L13-L107. The current version is "builtin' - compiled, and not as easy to find. – hpaulj Feb 01 '23 at 16:50
  • It's now in the `scipy/optimize/_lsap.c` file. – hpaulj Feb 01 '23 at 17:03
  • @hpaulj that's interesting, but not really relevant to my question - the question is mostly about the algorithm, and how to modify it / use different one, not so much about scipy's implementation.. – Jan Spurny Feb 02 '23 at 12:38
  • I have a feeling this post will be buried under new questions. I noticed it because I scan [scipy], but have explored `munkres` some years ago, trying speed it up, and looking at `lisp` versions. But it doesn't look like I posted anything. [hungarian-algorithm] tag has only 99 posts in a dozen years, and not many watchers. The current scipy function should be faster, but will be hard to tweak. I don't know if there's a way or place to post your question that doesn't get buried under a lot of other python/scipy questions. – hpaulj Feb 03 '23 at 17:17

0 Answers0