My data (df) looks like this:
Date | Name | Plan |
---|---|---|
2022 | John | College |
2022 | John | Work |
2021 | Kel | College |
2022 | James | Work |
2019 | Daron | College |
2019 | JQ | NaN |
2020 | Mel | College |
2017 | Shama | Work |
2021 | John | Nan |
2020 | John | Work |
2021 | Mel | Work |
2018 | Shama | Work |
My end result needs one plan (the most recent one), per one name.
Currently I: Drop all Plan NaN values, then sort by service date, and drop all but the most recent date using this code:
df = df.dropna(subset=['Plan'])
df = df.sort_values('Date').drop_duplicates('Name', keep='last')
This mostly works, but I need 'College' to take precedence over 'Work' when the two are put together on the same date. In the data above, this row: | 2022 | John |Work | would be the one kept from dropping duplicates and not the one with 'College'.
Everything works, except this little part where the dates are duplicated AND there are two differing plans.
In a non pandas setting I would think this:
if service dates are duplicated AND one == college AND other == anything else: then keep the one with college
The end result I need:
Date | Name | Plan |
---|---|---|
2022 | John | College |
2021 | Kel | College |
2022 | James | Work |
2019 | Daron | College |
2019 | JQ | NaN |
2021 | Mel | Work |
2018 | Shama | Work |
Let me know if that makes sense, Thank you!