I can't find any elegant way to select unique rows from column A
and column B
but not jointly and not in a sequence. This is in order to keep "inclusive" intersection of unique values from these two columns.
My aim is to keep as many unique values as possible across columns A
and B
. The columns are considered jointly but I am looking for all the unique "combinations" of their values...
Sample dataframe
df1 = pd.DataFrame({"A": [ "A1", "A2", "A2", "A3", "A3", ],
"B": [ "B1", "B1", "B2", "B3", "B1", ], },
index=[ 0, 1, 2, 3, 4, ])
Result:
A B
0 A1 B1
1 A2 B1
2 A2 B2
3 A3 B3
4 A3 B1
This does nothing useful...
df2 = df1.drop_duplicates( subset=[ "A", "B", ], keep="first", inplace=False, )
Result:
A B
0 A1 B1
1 A2 B1
2 A2 B2
3 A3 B3
4 A3 B1
The below code leaves duplicated B1
which can be removed later using drop_duplicates
on column B
but then A2
will also be removed and it could have been kept if it was present in one row with B2
, as it was at the index=2
of the original dataframe.
df3 = df1.drop_duplicates( subset=[ "A", ], keep="first", inplace=False, )
Result:
A B
0 A1 B1
1 A2 B1
3 A3 B3
As mentioned above A2
is removed, but there was an option to keep it, if it was present in one row with B2
, as it was at the index=2
of the original dataframe.
df4 = df3.drop_duplicates( subset=[ "B", ], keep="first", inplace=False, )
A B
0 A1 B1
3 A3 B3
Desired result:
A B
0 A1 B1
1 A2 B2
2 A3 B3
So my aim is to keep as many unique values as possible across columns A
and B
. The columns are considered jointly but I am looking for all the unique "combinations" of their values...