2

I can't find any elegant way to select unique rows from column A and column B but not jointly and not in a sequence. This is in order to keep "inclusive" intersection of unique values from these two columns.

My aim is to keep as many unique values as possible across columns A and B. The columns are considered jointly but I am looking for all the unique "combinations" of their values...

Sample dataframe

df1 = pd.DataFrame({"A": [ "A1", "A2", "A2", "A3", "A3", ],
                    "B": [ "B1", "B1", "B2", "B3", "B1", ], },
                   index=[ 0, 1, 2, 3, 4, ])

Result:

    A   B
0  A1  B1
1  A2  B1
2  A2  B2
3  A3  B3
4  A3  B1

This does nothing useful...

df2 = df1.drop_duplicates( subset=[ "A", "B", ], keep="first", inplace=False, )

Result:

    A   B
0  A1  B1
1  A2  B1
2  A2  B2
3  A3  B3
4  A3  B1

The below code leaves duplicated B1 which can be removed later using drop_duplicates on column B but then A2 will also be removed and it could have been kept if it was present in one row with B2, as it was at the index=2 of the original dataframe.

df3 = df1.drop_duplicates( subset=[ "A", ], keep="first", inplace=False, )

Result:

    A   B
0  A1  B1
1  A2  B1
3  A3  B3

As mentioned above A2 is removed, but there was an option to keep it, if it was present in one row with B2, as it was at the index=2 of the original dataframe.

df4 = df3.drop_duplicates( subset=[ "B", ], keep="first", inplace=False, )

    A   B
0  A1  B1
3  A3  B3

Desired result:

    A   B
0  A1  B1
1  A2  B2
2  A3  B3

So my aim is to keep as many unique values as possible across columns A and B. The columns are considered jointly but I am looking for all the unique "combinations" of their values...

Alex
  • 43
  • 4
  • And what happens if you add one more row to your original DataFrame ['A3', 'B4']. Should this row also appear in your output? Right now, I can't tell if you are trying to get the maximum number of rows with entirely unique values, or the minimum number of rows that capture all unique values. – ALollz Mar 15 '19 at 15:32
  • I don't mind if row `['A3', 'B4']` happens to appear in the output as I can filter it out in the next step by removing duplicates in the `A` column. What I am after is to get **all** the values from the `A` column but only once (so this column contains only unique values) but paired with unique values from the `B` column. – Alex Mar 15 '19 at 17:24
  • Now if I use `subset=[ "A", "B"]` then the output contains rows unique in both. If I drop duplicates from the `A` column first and from the `B` column afterwards many unique values will be lost. I would need to remove the duplicates from `A` column but in a way that saves as many values in the `B` column as possible and the other way around... – Alex Mar 15 '19 at 17:28
  • possibly something with the use of `networkx.algorithms.matching.max_weight_matching` ? – Alex Mar 15 '19 at 17:57
  • @Alex, you need the unique value of both of the column, right? then how you think that below code is not work for it? – Ghanshyam Savaliya Mar 16 '19 at 11:48
  • Thank you very much! This works for the problem that I have outlined above. – Alex Mar 20 '19 at 13:55

1 Answers1

-1

Try Below Code:

df1.drop_duplicates( subset=[ "A" and "B"], keep="first", inplace=False, )

Output:

    A   B
0   A1  B1
2   A2  B2
3   A3  B3