1

I'm new to pandas. So I have dataframe that looks like that:

    id  car date    color
1   2   bmw 2021-05-21  black
2   3   bmw 2021-05-21  yellow
3   4   mercedes    2021-06-21  red
4   5   toyota  2021-11-01  pink
5   6   toyota  2021-09-06  black
6   7   mercedes    2021-07-07  white

I need to choose rows that have unique "car" with the latest date and color in priority (red, pink, black, white, yellow), by that i mean if car in dataframe sorted by date has color red, then i save that line, if car doesn't have red color then i search for pink color & etc.

so firstly - sort by date secondly - if date is duplicated, i sort it by needed color

i did sorting by date: df.sort_values(by="date").drop_duplicates(subset=["car", "color"], keep="last")

and it looks like that rn:

    id  car date    color
1   2   bmw 2021-05-21  black
2   3   bmw 2021-05-21  yellow
3   4   mercedes    2021-06-21  red
6   7   mercedes    2021-07-07  white
5   6   toyota  2021-09-06  black
4   5   toyota  2021-11-01  pink

what i actually wanna see:

id  car date    color
2   bmw 2021-05-21  black
5   toyota  2021-11-01  pink
7   mercedes    2021-07-07  white
Sara
  • 41
  • 4
  • 1
    @user3483203 well I see, I change that to car + color, cuz i need to save colors in next step where i sort it by priority colors, maybe the logic isn't right, but i need to delete rows where the date isn't latest – Sara Aug 02 '21 at 19:20

1 Answers1

1

IIUC, use pd.Categorical to define your order, then sort_values + groupby.

df["color"] = pd.Categorical(
    df["color"], categories=["red", "pink", "black", "white", "yellow"], ordered=True
)

df.sort_values(by=["date", "color"], ascending=[0, 1]).groupby(
    "car", as_index=0
).first()

        car  id       date  color
0       bmw   2 2021-05-21  black
1  mercedes   7 2021-07-07  white
2    toyota   5 2021-11-01   pink
user3483203
  • 50,081
  • 9
  • 65
  • 94