Python: Select single minimum-distance pair based not only on values, but also on other participants minimum-distance pairs

Question

For example, I have two lists of entities and a function that measures the distance between them. Let's say it's Name and Email. In the table below for each email I measured the distance to each Name.

    1@ - {A:0.2, B:0.3, C:0.4, D:0.6}
    2@ - {A:0.15, B:0.2, C:0.2, D:0.5}
    3@ - {A:0.1, B:0.05, C:0.03, D:0.2}

Now I want to find single minimum-distance pair for each Email in Names. But, paying attention that if two Emails have same minimum-distance Name candidate, wins whoever has smallest distance. In this case other one Email should select second-closest Name candidate and check again.

So, in this case result should be:

    1@: B  
    2@: A
    3@: C

Table to explain:

emails/names	A	B	C	D
1@	0.2	0.3	0.4	0.6
2@	0.15	0.2	0.2	0.5
3@	0.1	0.05	0.03	0.2

Speed is important.. It could be processed in a form of dataframe or dicts, does not matter.
Thanks for any help.

UPD:

It's possible when the number of Emails > the number of Names, so some Emails will be unassigned, need also to catch them.

Andrej Kesely · Accepted Answer · 2021-04-11T22:21:03.653

Supposing you have this dataframe:

  emails/names     A     B     C    D
0           1@  0.20  0.30  0.40  0.6
1           2@  0.15  0.20  0.20  0.5
2           3@  0.10  0.05  0.03  0.2

Then:

df = df.set_index("emails/names")
numpy_df = df.to_numpy()

forbidden_rows, forbidden_cols = [], []
while len(forbidden_rows) != len(df):
    row, col = np.unravel_index(numpy_df.argmin(), df.shape)
    numpy_df[:, col] = np.inf
    numpy_df[row, :] = np.inf
    forbidden_rows.append(df.index[row])
    forbidden_cols.append(df.columns[col])

for r, c in zip(forbidden_rows, forbidden_cols):
    print(r, c)

Prints:

3@ C
2@ A
1@ B

EDIT: Converted the dataframe to numpy.ndarray first.

EDIT: To print unassigned emails:

For this dataframe:

  emails/names     A     B     C    D
0           1@  0.20  0.30  0.40  0.6
1           2@  0.15  0.20  0.20  0.5
2           3@  0.10  0.05  0.03  0.2
3           4@  0.10  0.05  0.03  0.2
4           5@  0.11  0.25  0.43  0.2
5           6@  0.12  0.35  0.53  0.3

This:

df = df.set_index("emails/names")
numpy_df = df.to_numpy()

forbidden_rows, forbidden_cols = [], []
while len(forbidden_rows) != len(df) and len(forbidden_cols) != len(df.columns):
    row, col = np.unravel_index(numpy_df.argmin(), df.shape)
    numpy_df[:, col] = np.inf
    numpy_df[row, :] = np.inf
    forbidden_rows.append(df.index[row])
    forbidden_cols.append(df.columns[col])

for r, c in zip(forbidden_rows, forbidden_cols):
    print(r, c)

print("Unassigned emails:")
print(df.index[~df.index.isin(forbidden_rows)].values)

Prints:

3@ C
4@ B
5@ A
6@ D
Unassigned emails:
['1@' '2@']

Wow, cool solution! I still need some time to process it, understand how np.unravel_index works, but solution seems correct. Thanks. — Alex_Y, Apr 11 '21 at 21:58
@Oleksii I "borrowed" this from: https://stackoverflow.com/questions/3230067/numpy-minimum-in-row-column-format — Andrej Kesely, Apr 11 '21 at 21:59
One more request - sometimes could be case when Emails more then Names. How to catch such "left unassigned" Emails? — Alex_Y, Apr 11 '21 at 22:07

Python: Select single minimum-distance pair based not only on values, but also on other participants minimum-distance pairs

1 Answers1