pandas inclusive unique values from two columns

Question

I can't find any elegant way to select unique rows from column A and column B but not jointly and not in a sequence. This is in order to keep "inclusive" intersection of unique values from these two columns.

My aim is to keep as many unique values as possible across columns A and B. The columns are considered jointly but I am looking for all the unique "combinations" of their values...

Sample dataframe

df1 = pd.DataFrame({"A": [ "A1", "A2", "A2", "A3", "A3", ],
                    "B": [ "B1", "B1", "B2", "B3", "B1", ], },
                   index=[ 0, 1, 2, 3, 4, ])

Result:

    A   B
0  A1  B1
1  A2  B1
2  A2  B2
3  A3  B3
4  A3  B1

This does nothing useful...

df2 = df1.drop_duplicates( subset=[ "A", "B", ], keep="first", inplace=False, )

Result:

    A   B
0  A1  B1
1  A2  B1
2  A2  B2
3  A3  B3
4  A3  B1

The below code leaves duplicated B1 which can be removed later using drop_duplicates on column B but then A2 will also be removed and it could have been kept if it was present in one row with B2, as it was at the index=2 of the original dataframe.

df3 = df1.drop_duplicates( subset=[ "A", ], keep="first", inplace=False, )

Result:

    A   B
0  A1  B1
1  A2  B1
3  A3  B3

As mentioned above A2 is removed, but there was an option to keep it, if it was present in one row with B2, as it was at the index=2 of the original dataframe.

df4 = df3.drop_duplicates( subset=[ "B", ], keep="first", inplace=False, )

    A   B
0  A1  B1
3  A3  B3

Desired result:

    A   B
0  A1  B1
1  A2  B2
2  A3  B3

So my aim is to keep as many unique values as possible across columns A and B. The columns are considered jointly but I am looking for all the unique "combinations" of their values...

And what happens if you add one more row to your original DataFrame ['A3', 'B4']. Should this row also appear in your output? Right now, I can't tell if you are trying to get the maximum number of rows with entirely unique values, or the minimum number of rows that capture all unique values. — ALollz, Mar 15 '19 at 15:32
I don't mind if row `['A3', 'B4']` happens to appear in the output as I can filter it out in the next step by removing duplicates in the `A` column. What I am after is to get **all** the values from the `A` column but only once (so this column contains only unique values) but paired with unique values from the `B` column. — Alex, Mar 15 '19 at 17:24
Now if I use `subset=[ "A", "B"]` then the output contains rows unique in both. If I drop duplicates from the `A` column first and from the `B` column afterwards many unique values will be lost. I would need to remove the duplicates from `A` column but in a way that saves as many values in the `B` column as possible and the other way around... — Alex, Mar 15 '19 at 17:28
possibly something with the use of `networkx.algorithms.matching.max_weight_matching` ? — Alex, Mar 15 '19 at 17:57
@Alex, you need the unique value of both of the column, right? then how you think that below code is not work for it? — Ghanshyam Savaliya, Mar 16 '19 at 11:48
Thank you very much! This works for the problem that I have outlined above. — Alex, Mar 20 '19 at 13:55

score -1 · Accepted Answer · answered Mar 15 '19 at 12:54

-1

Try Below Code:

df1.drop_duplicates( subset=[ "A" and "B"], keep="first", inplace=False, )

Output:

answered Mar 15 '19 at 12:54

Ghanshyam Savaliya

568
5
22

pandas inclusive unique values from two columns

1 Answers1