Python Pandas: How to choose a certain option within duplicates

Question

My data (df) looks like this:

Date	Name	Plan
2022	John	College
2022	John	Work
2021	Kel	College
2022	James	Work
2019	Daron	College
2019	JQ	NaN
2020	Mel	College
2017	Shama	Work
2021	John	Nan
2020	John	Work
2021	Mel	Work
2018	Shama	Work

My end result needs one plan (the most recent one), per one name.

Currently I: Drop all Plan NaN values, then sort by service date, and drop all but the most recent date using this code:

df = df.dropna(subset=['Plan'])
df = df.sort_values('Date').drop_duplicates('Name', keep='last')

This mostly works, but I need 'College' to take precedence over 'Work' when the two are put together on the same date. In the data above, this row: | 2022 | John |Work | would be the one kept from dropping duplicates and not the one with 'College'.

Everything works, except this little part where the dates are duplicated AND there are two differing plans.

In a non pandas setting I would think this:

if service dates are duplicated AND one == college AND other == anything else: then keep the one with college

The end result I need:

Date	Name	Plan
2022	John	College
2021	Kel	College
2022	James	Work
2019	Daron	College
2019	JQ	NaN
2021	Mel	Work
2018	Shama	Work

Let me know if that makes sense, Thank you!

Could you explain the thought behind this? It didn't work – Matthew Rozanoff Jul 20 '22 at 14:19 — Matthew Rozanoff, Jul 20 '22 at 14:19

score 1 · Accepted Answer · answered Jul 20 '22 at 14:15

You can use a custom sort for "plan", where you give priority to "College" over "Work". Here taking advantage of an ordered Categorical, but you could also go with a mapping from a dictionary:

(df
 .assign(cat=pd.Categorical(df['Plan'], categories=['Work', 'College'],
                            ordered=True))
 .sort_values(by=['Date', 'cat'], na_position='first')
 .drop(columns='cat')
 .groupby('Name', as_index=False).last()
)

output:

    Name  Date     Plan
0  Daron  2019  College
1     JQ  2019     None
2  James  2022     Work
3   John  2022  College
4    Kel  2021  College
5    Mel  2021     Work
6  Shama  2018     Work

score 0 · Answer 2 · answered Jul 20 '22 at 14:18

0

Let us sort the values then drop the duplicates in Name

df['Plan'] = pd.Categorical(df['Plan'], ['College', 'Work'], True)
df.sort_values(['Date', 'Plan'], ascending=[0, 1]).drop_duplicates('Name')

    Date   Name     Plan
0   2022   John  College
3   2022  James     Work
2   2021    Kel  College
10  2021    Mel     Work
4   2019  Daron  College
5   2019     JQ      NaN
11  2018  Shama     Work

answered Jul 20 '22 at 14:18

Shubham Sharma

68,127
6
24
53

College and Other worked, but nothing else showed up and got a slicer error. Thank you for your input! – Matthew Rozanoff Jul 20 '22 at 14:43

Python Pandas: How to choose a certain option within duplicates

2 Answers2