Creating a unique id based on combinations of columns (ignoring order)

Question

Scenario

Imagine a dataset that is the result of some process. The dataset contains two id columns, id1 and id2, that represent where parts of the data came from in preceding process. The two id columns can be int or string. I want to create a unique id based on the ids present in both of these id columns. However, the ordering of ids is not important. For example:

id1 == A and id2 == NaN is the same as id1 == NaN and id2==A because the only 'real' id present is A. The id value could be a single letter, number, or string(s).

Test data

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'id1': ['A', np.nan, 'A', 'B'], 'id2': [np.nan, 'B', 'B', 'A']})
>>> df
   id1  id2
0    A  NaN
1  NaN    B
2    B    B
3    B    A

Core question

How can I use pandas and numpy functions to create a single unique id based on the combinations of id1 and id2 when ordering is not important?

Things I've tried

Coercing id1 and id2 to strings, combining, sorting, and dropping duplicates (works but is messy)
pd.factorize: seems better than above but still requires a combination and sorting of id1 and id2

Desired output

>>> df
   id1  id2  combined_id
0    A  NaN  A
1  NaN    B  B
2    A    B  AB
3    B    A  AB
4  NaN  NaN  NaN

What about `B A` will that be `AB` also since ordering is not important or will it be `BA`? — Onyambu, Feb 24 '23 at 19:35
@onyambu thanks for your response! Yes, `B A` should also become `AB`. — whatwhatWHAT, Feb 26 '23 at 15:17
@whatwhatWHAT is it okay to create a numerical id for the combination of both columns? for example for the givwn sample the resulting id column would be (0, 1, 2, 2, 3) — Shubham Sharma, Feb 26 '23 at 16:08
@ShubhamSharma yes, for sure! I think `pd.factorize` accomplishes this, in a sense. — whatwhatWHAT, Feb 27 '23 at 21:25

score 0 · Answer 1 · answered Feb 24 '23 at 19:46

0

Added a row B A to your data

(df.reset_index().melt('index').dropna().sort_values('value')
   .groupby('index').agg({'value':''.join}))
 
      value
index      
0         A
1         B
2        AB
3        AB

answered Feb 24 '23 at 19:46

Onyambu

67,392
3
24
53

Will this work for strings? E.g,. what if the `id`s are `Red Blue`, which should be the same as `Blue Red`? – whatwhatWHAT Feb 26 '23 at 15:18
@whatwhatWHAT Yes it will work for strings. Try it out – Onyambu Feb 26 '23 at 18:15

score 0 · Answer 2 · answered Feb 24 '23 at 20:23

0

The simplest way is:

df['combinedid'] = df['id1']+ df['id2']

Result:

	id1	id2	combinedid
0	A
1		B
2	A	B	AB

After you remove the NaN values:

df = df.fillna("")
df['combinedid'] = df['id1']+ df['id2']

Result:

	id1	id2	combinedid
0	A		A
1		B	B
2	A	B	AB

answered Feb 24 '23 at 20:23

Lars K.

59
4

I updated the question to show why those solutions would not work (e.g., in the case of `B A` which is equivalent to `A B` – whatwhatWHAT Feb 26 '23 at 15:17