1

Scenario

Imagine a dataset that is the result of some process. The dataset contains two id columns, id1 and id2, that represent where parts of the data came from in preceding process. The two id columns can be int or string. I want to create a unique id based on the ids present in both of these id columns. However, the ordering of ids is not important. For example:

id1 == A and id2 == NaN is the same as id1 == NaN and id2==A because the only 'real' id present is A. The id value could be a single letter, number, or string(s).

Test data

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'id1': ['A', np.nan, 'A', 'B'], 'id2': [np.nan, 'B', 'B', 'A']})
>>> df
   id1  id2
0    A  NaN
1  NaN    B
2    B    B
3    B    A

Core question

How can I use pandas and numpy functions to create a single unique id based on the combinations of id1 and id2 when ordering is not important?

Other relevant answers

In Pandas, how to create a unique ID based on the combination of many columns?

Grouping by multiple columns to find duplicate rows pandas

Things I've tried

  • Coercing id1 and id2 to strings, combining, sorting, and dropping duplicates (works but is messy)
  • pd.factorize: seems better than above but still requires a combination and sorting of id1 and id2

Desired output

>>> df
   id1  id2  combined_id
0    A  NaN  A
1  NaN    B  B
2    A    B  AB
3    B    A  AB
4  NaN  NaN  NaN
whatwhatWHAT
  • 123
  • 4

2 Answers2

0

Added a row B A to your data

(df.reset_index().melt('index').dropna().sort_values('value')
   .groupby('index').agg({'value':''.join}))
 
      value
index      
0         A
1         B
2        AB
3        AB
Onyambu
  • 67,392
  • 3
  • 24
  • 53
0

The simplest way is:

df['combinedid'] = df['id1']+ df['id2']

Result:

id1 id2 combinedid
0 A
1 B
2 A B AB

After you remove the NaN values:

df = df.fillna("")
df['combinedid'] = df['id1']+ df['id2']

Result:

id1 id2 combinedid
0 A A
1 B B
2 A B AB
Lars K.
  • 59
  • 4
  • I updated the question to show why those solutions would not work (e.g., in the case of `B A` which is equivalent to `A B` – whatwhatWHAT Feb 26 '23 at 15:17