Scenario
Imagine a dataset that is the result of some process. The dataset contains two id columns, id1
and id2
, that represent where parts of the data came from in preceding process. The two id columns can be int or string. I want to create a unique id based on the ids present in both of these id columns. However, the ordering of ids is not important. For example:
id1 == A
and id2 == NaN
is the same as id1 == NaN
and id2==A
because the only 'real' id present is A
. The id value could be a single letter, number, or string(s).
Test data
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'id1': ['A', np.nan, 'A', 'B'], 'id2': [np.nan, 'B', 'B', 'A']})
>>> df
id1 id2
0 A NaN
1 NaN B
2 B B
3 B A
Core question
How can I use pandas and numpy functions to create a single unique id based on the combinations of id1 and id2 when ordering is not important?
Other relevant answers
In Pandas, how to create a unique ID based on the combination of many columns?
Grouping by multiple columns to find duplicate rows pandas
Things I've tried
- Coercing id1 and id2 to strings, combining, sorting, and dropping duplicates (works but is messy)
pd.factorize
: seems better than above but still requires a combination and sorting of id1 and id2
Desired output
>>> df
id1 id2 combined_id
0 A NaN A
1 NaN B B
2 A B AB
3 B A AB
4 NaN NaN NaN