1

I have a dataset with 3 target classes: ‘Yes’, ‘Maybe’, and ‘No’.

Unique_id       target
111              Yes
111             Maybe
111              No
112              No
112             Maybe
113              No

I want to drop duplicate rows based on unique_id. But ‘drop duplicates’ generally keeps the first or last row, and I want to keep the rows based on following the conditions:

1) If unique_id has all the 3 classes (Yes, Maybe and No), we’ll keep only the ‘Yes’ class.
2) If unique_id has the 2 classes (Maybe and No), we’ll keep only the ‘Maybe’ class.
3) We’ll keep the ‘No’ class when only ‘No’ will be there.

I tried ‘sort_values’ the target class (Yes=1, Maybe=2, No=3) and then dropped the duplicates.

Desired output:

Unique_id       target
111               Yes
112              Maybe
113               No

I’m thinking if there are better ways to do that.

Any suggestions would be appreciated. Thanks!

Ch3steR
  • 20,090
  • 4
  • 28
  • 58
Roy
  • 924
  • 1
  • 6
  • 17
  • 2
    if I well understood you used the integer value 1,2,3 instead of the string, sorted and drop. I think it is ok. A very similar solution would be to use [categorical data](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html) using `ordered=True`. You can also do a groupby and take the minimum (between 1,2,3) – Ruggero Turra Sep 07 '21 at 15:49
  • All rows that within ID have that target or just any one? – ALollz Sep 07 '21 at 15:52
  • See if you require any further clarification from the answers below.  If no more question,  let us know any of the answers best fit for your needs by accepting the answer you choose.  Thanks! – SeaBean Sep 10 '21 at 05:59

2 Answers2

3

You can set column target as a Categorical data type by pd.CategoricalDtype with the ordering of ['Yes' < 'Maybe' < 'No'], as follows:

t = pd.CategoricalDtype(categories=['Yes', 'Maybe', 'No'], ordered=True)
df['target'] = df['target'].astype(t)

Then, you group by Unique_id using .groupby() and take the min on target within the group of same Unique_id using .GroupBy.min():

df.groupby('Unique_id', as_index=False)['target'].min()

Result:

   Unique_id target
0        111    Yes
1        112  Maybe
2        113     No

Edit

Case 1: If you have 2 or more similar columns (e.g. target and target2) to sort in the same ordering, you need just apply the codes to the 2 columns. For example, if we have the following dataframe:

   Unique_id target target2
0        111    Yes      No
1        111  Maybe   Maybe
2        111     No     Yes
3        112     No      No
4        112  Maybe   Maybe
5        113     No   Maybe

You can get the min of the 2 columns simultaneously, as follows:

t = pd.CategoricalDtype(categories=['Yes', 'Maybe', 'No'], ordered=True)
df[['target', 'target2']] = df[['target', 'target2']].astype(t)

df.groupby('Unique_id', as_index=False)[['target', 'target2']].min()

Result:

   Unique_id target target2
0        111    Yes     Yes
1        112  Maybe   Maybe
2        113     No   Maybe

Case 2: If you want to display all columns in the dataframe instead of just Unique_id and target columns, you can use an even simpler syntax, as follows:

Another dataframe example:

   Unique_id target  Amount
0        111    Yes     123
1        111  Maybe     456
2        111     No     789
3        112     No    1234
4        112  Maybe    5678
5        113     No      25

Then, to show all columns with target with min values for an Unique_id, you can use:

t = pd.CategoricalDtype(categories=['Yes', 'Maybe', 'No'], ordered=True)
df['target'] = df['target'].astype(t)

df.loc[df.groupby('Unique_id')['target'].idxmin()]

Result:

   Unique_id target  Amount
0        111    Yes     123
4        112  Maybe    5678
5        113     No      25
SeaBean
  • 22,547
  • 3
  • 13
  • 25
  • Hi @SeaBean, thank you. It works. How about if I have many columns? – Roy Sep 14 '21 at 15:17
  • @Roy You mean which of the following cases: 1) you have many columns but only one column needs to be sorted in this order, or 2) you have 2 or more columns need to be sorted in this order ? If it is case 2) this solution works better than the other solution. we can set type on the 2 columns and also groupby the 2 columns. If it is case 1) we can also make use of idxmin to show all columns with the target column sorted. In summary, using Categorical data type allows you to show 2 columns e.g `target` and `target2` with similar sorting sequence and at the same time can also sort one. – SeaBean Sep 14 '21 at 15:55
  • @Roy See my edit. You will then understand the advantages of using Categorical data type here. If you just want to show all columns, we can use an even simpler codes `df.loc[df.groupby('Unique_id')['target'].idxmin()]` to achieve it. If you want the 2 similar columns to get only the min in the group, we can readily use the codes for 2 columns. Take a look. – SeaBean Sep 14 '21 at 16:16
  • 1
    Wow, that's a great way of doing it, @SeaBean. Thank you so much :) – Roy Sep 15 '21 at 15:25
1

Using map and idxmin:

t = {'Yes':0, 'Maybe':1, 'No':2}
df.loc[df.assign(tar=df.target.map(t)).groupby('Unique_id')['tar'].idxmin()]
Pygirl
  • 12,969
  • 5
  • 30
  • 43