1

For example, I have two lists of entities and a function that measures the distance between them. Let's say it's Name and Email. In the table below for each email I measured the distance to each Name.

    1@ - {A:0.2, B:0.3, C:0.4, D:0.6}
    2@ - {A:0.15, B:0.2, C:0.2, D:0.5}
    3@ - {A:0.1, B:0.05, C:0.03, D:0.2}

Now I want to find single minimum-distance pair for each Email in Names. But, paying attention that if two Emails have same minimum-distance Name candidate, wins whoever has smallest distance. In this case other one Email should select second-closest Name candidate and check again.

So, in this case result should be:

    1@: B  
    2@: A
    3@: C

Table to explain:

emails/names A B C D
1@ 0.2 0.3 0.4 0.6
2@ 0.15 0.2 0.2 0.5
3@ 0.1 0.05 0.03 0.2

Speed is important.. It could be processed in a form of dataframe or dicts, does not matter.
Thanks for any help.


UPD:

It's possible when the number of Emails > the number of Names, so some Emails will be unassigned, need also to catch them.

Alex_Y
  • 588
  • 3
  • 19

1 Answers1

1

Supposing you have this dataframe:

  emails/names     A     B     C    D
0           1@  0.20  0.30  0.40  0.6
1           2@  0.15  0.20  0.20  0.5
2           3@  0.10  0.05  0.03  0.2

Then:

df = df.set_index("emails/names")
numpy_df = df.to_numpy()

forbidden_rows, forbidden_cols = [], []
while len(forbidden_rows) != len(df):
    row, col = np.unravel_index(numpy_df.argmin(), df.shape)
    numpy_df[:, col] = np.inf
    numpy_df[row, :] = np.inf
    forbidden_rows.append(df.index[row])
    forbidden_cols.append(df.columns[col])

for r, c in zip(forbidden_rows, forbidden_cols):
    print(r, c)

Prints:

3@ C
2@ A
1@ B

EDIT: Converted the dataframe to numpy.ndarray first.


EDIT: To print unassigned emails:

For this dataframe:

  emails/names     A     B     C    D
0           1@  0.20  0.30  0.40  0.6
1           2@  0.15  0.20  0.20  0.5
2           3@  0.10  0.05  0.03  0.2
3           4@  0.10  0.05  0.03  0.2
4           5@  0.11  0.25  0.43  0.2
5           6@  0.12  0.35  0.53  0.3

This:

df = df.set_index("emails/names")
numpy_df = df.to_numpy()

forbidden_rows, forbidden_cols = [], []
while len(forbidden_rows) != len(df) and len(forbidden_cols) != len(df.columns):
    row, col = np.unravel_index(numpy_df.argmin(), df.shape)
    numpy_df[:, col] = np.inf
    numpy_df[row, :] = np.inf
    forbidden_rows.append(df.index[row])
    forbidden_cols.append(df.columns[col])

for r, c in zip(forbidden_rows, forbidden_cols):
    print(r, c)

print("Unassigned emails:")
print(df.index[~df.index.isin(forbidden_rows)].values)

Prints:

3@ C
4@ B
5@ A
6@ D
Unassigned emails:
['1@' '2@']
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • 1
    Wow, cool solution! I still need some time to process it, understand how np.unravel_index works, but solution seems correct. Thanks. – Alex_Y Apr 11 '21 at 21:58
  • 1
    @Oleksii I "borrowed" this from: https://stackoverflow.com/questions/3230067/numpy-minimum-in-row-column-format – Andrej Kesely Apr 11 '21 at 21:59
  • One more request - sometimes could be case when Emails more then Names. How to catch such "left unassigned" Emails? – Alex_Y Apr 11 '21 at 22:07