Python pandas - group by column A and prevent duplicated existence on column b?

Question

Suppose I have a dataframe

df = pd.DataFrame({"SKU": ["Coke", "Coke", "Coke", "Bread", "Bread", "Bread", "cake", "cake", "cake"], 
              "campaign":["buy1get1","$19", "event", "buy1get1","$19", "event", "buy1get1","$19", "event"],
                   "score": [0.9, 0.8, 0.4, 0.7, 0.6, 0.3, 0.5, 0.7, 0.5]})

    SKU    campaign score
0   Coke    buy1get1    0.9
1   Coke    $19         0.8
2   Coke    event       0.4
3   Bread   buy1get1    0.7
4   Bread   $19         0.6
5   Bread   event       0.3
6   cake    buy1get1    0.5
7   cake    $19         0.7
8   cake    event       0.5

I want to get the best product for each campaign which would be

df.sort_values("score").groupby("campaign", as_index=False).last()

and leads to the following output

    campanign   SKU score
0   $19         Coke    0.8
1   buy1get1    Coke    0.9
2   event       cake    0.5

But What I want is as follows as coke is already used in buy1get1 campaign and has higher score.

    campaign    SKU     score
0   $19         cake     0.7
1   event       Bread    0.3
2   buy1get1    Coke     0.9

logic:

go to the second largest value for campaign $19 because Coke is used already for campaign buy1get1 with higher score (0.9>0.8).
and then we got cake for campaign $19, that means we cannot use cake for campaign "event" anymore. and thus we go to third largest number for event: Bread

I have tried to think of some ways, but none of them are efficient/pythonic. I will need to deal with a large data set.

Instead of going into inefficient loop and/or other chaos, is there a better way for clarifying this kind of issue?

Your opinion and information will be much appreciated.

Is possible some categories is necessary skip not only first maximal, but also second, third... ? If yes, is posible change data sample for this data? — jezrael, Jan 28 '21 at 07:01
hi @jezrael thank you for the comment! is it is necessary to skip 1st, 2nd, 3rd.. as long as there is duplicates occurred. If to change the data sample, what is the better way to change it? — Leigh Tsai, Jan 28 '21 at 07:08
also, your answer is much appreciated. what I can think of now to do the above selection is to loop your answer several times until len(df["product"].unique() == number of campaign) — Leigh Tsai, Jan 28 '21 at 07:09
Not sure if understand, I think change data for e.g. skipping firast, second and third product. — jezrael, Jan 28 '21 at 07:09
I hate to say that, but I don't think it is possible for this case right now — Leigh Tsai, Jan 28 '21 at 07:11
Ya, I think change data for possible test this complicated data by editing question. — jezrael, Jan 28 '21 at 07:11
hi @jezrael I edited the question. Thank you. Don't think there is an efficient way as long as I write more about the logic though.. — Leigh Tsai, Jan 28 '21 at 07:22
thank you, I just realised my mask was wrong. I correct it and working with new data well. Is possible test if working well in general data? — jezrael, Jan 28 '21 at 07:34

jezrael · Accepted Answer · 2021-01-28T07:34:12.173

1

One idea is get product which matching last values and remove duplicates only for this product in original DataFrame and then run your solution again with new data:

df1 = df.sort_values("score")

last = df1.groupby("campaign")['SKU'].last()
mask = ~df1['SKU'].isin(last) | ~df1['SKU'].duplicated(keep='last')

df = df1[mask].groupby("campaign", as_index=False).last()
print (df)
   campaign    SKU  score
0       $19   cake    0.7
1  buy1get1   Coke    0.9
2     event  Bread    0.3

edited Jan 28 '21 at 07:34

answered Jan 28 '21 at 06:57

jezrael

822,522
95
1,334
1,252

1

Thank you for your assistance. Excellent! – Leigh Tsai Feb 02 '21 at 02:16

score 0 · Answer 2 · answered Jan 28 '21 at 07:53

You can try the below one where we are sorting data and then grouping data based on the SKU column and picking the first data from the group.

You can also change the sort order to pick the least one as well as the highest one

import pandas as pd
df = pd.DataFrame({"SKU": ["Coke", "Coke", "Coke", "Bread", "Bread", "Bread", "cake", "cake", "cake"], 
              "campaign":["buy1get1","$19", "event", "buy1get1","$19", "event", "buy1get1","$19", "event"],
                   "score": [0.9, 0.8, 0.4, 0.7, 0.6, 0.3, 0.5, 0.7, 0.5]})

df = df.sort_values("score", ascending=False).groupby(['SKU']).head(1).reset_index(drop=True)

print(df)
     SKU  campaign  score
0   Coke  buy1get1    0.9
1  Bread  buy1get1    0.7
2   cake       $19    0.7

Python pandas - group by column A and prevent duplicated existence on column b?

2 Answers2