Compare data series columns based on a column that contains duplicates

Question

I have a dataset that I've created from merging 2 df's together on the "NAME" column and now I have a larger dataset. To finish the DF, I want to perform some logic to it to clean it up.

Requirements: I want to select the unique 'NAME' but I want to match the name with the highest Sales row, and if after going though the Sales column, all rows are less than 10, then move to the Calls column and select highest the row with the highest Call, and if all calls in the 'CALLS' are less than 10 then move to the Target Column select the highest Target. No rows are summed.

Here's my DF:

             NAME       CUSTOMER_SUPPLIER_NUMBER           Sales  Calls  Target
0        OFFICE 1        2222277                             84    170     265
1        OFFICE 1        2222278                             26    103     287
2        OFFICE 1        2222278                             97    167     288
3        OFFICE 2        2222289                              7    167     288
4        OFFICE 2        2222289                              3    130     295
5        OFFICE 2        2222289                              9    195     257
6        OFFICE 3        1111111                              1      2     286
7        OFFICE 3        1111111                              5      2     287
8        OFFICE 3        1111112                              9      7     230
9        OFFICE 4        1111171                             95     193    299
10       OFFICE 5        1111191                              9     193    298

Here's what I want to show in the final DF:

             NAME       CUSTOMER_SUPPLIER_NUMBER           Sales  Calls  Target
0        OFFICE 1        2222277                             97    167     288
5        OFFICE 2        2222289                              9    195     257
7        OFFICE 3        1111111                              5      2     287
9        OFFICE 4        1111171                             95    193     299
10       OFFICE 5        1111191                              9    193     298

I was thinking of solving this by using df.itterows()

Here's what I've tried:

for n, v in df.iterrows(): 
    if int(v['Sales']) > 10:
        calls = df.loc[(v['NAME'] == v) & (int(v['Calls'].max()))]
        if int(calls['Calls']) > 10:
            target =  df.loc[(v['NAME'] == v) & (int(v['Target'].max()))]
        else:
            print("No match found")
    else:
       sales = df.loc[(v['NAME'] == v) & (int(v['Sales'].max())]

However, I keep getting KeyError: False error messages. Any thoughts on what I'm doing wrong?

By _"if 'CALLS' are less than 10"_ do you mean, if the 'CALLS' that was selected is less than 10, of the sum of CALLS for that CUSTOMER_SUPPLIER_NUMBER, or what...? — , Dec 08 '21 at 00:53
good question, thanks for clarifying that. Im looking for the row with the highest call value, not the sum of the values. — aero8991, Dec 08 '21 at 00:58
By "if sales are less than 10," do you mean the "if highest sales are less than 10"? — , Dec 08 '21 at 01:00
yes exactly! if the highest is less than 10, move on to the next column and check Calls, and follow that same logic of checking to see if less than 10 — aero8991, Dec 08 '21 at 01:04

Brent · Accepted Answer · 2021-12-08T03:13:11.583

This is not optimized, but it should meet your needs. The code snippet sends each NAME group to eval_group() where it checks the highest index for each column until the Sales, Calls, Target criteria is met.

If you were to optimize, then you could apply vectorization or parallelism principles to the eval_group so it is called against all groups at once, instead of sequentially.

A couple of notes, this will return the first row if a race condition is found (i.e. multiple records have the same maximum during idxmax() call). Also, I believe in your question, the first row in the desired answer should have OFFICE 1 being row 2, not 0.

df = pd.read_csv('./data.txt')

def eval_group(df, keys) :
    for key in keys : 
        row_id = df[key].idxmax()
        if df.loc[row_id][key] >= 10 or key == keys[-1] :
            return row_id

row_ids = []
keys = ['Sales','Calls','Target']
for name in df['NAME'].unique().tolist() :
    condition = df['NAME'] == name 
    row_ids.append( eval_group( df[condition], keys) )

df = df[ df.index.isin(row_ids) ]

df
        NAME  CUSTOMER_SUPPLIER_NUMBER  Sales  Calls  Target
2   OFFICE 1                   2222278     97    167     288
5   OFFICE 2                   2222289      9    195     257
7   OFFICE 3                   1111111      5      2     287
9   OFFICE 4                   1111171     95    193     299
10  OFFICE 5                   1111191      9    193     298

hey thanks a lot for this response, I appreciate it! I wasnt able to get the same results. I got this as a result. Also please note I changed a few of the numbers around but I expect the same results. I am getting Office 1 appearing 3 times, and office 2 appearing twice. I copied your formula and logic exactly — aero8991, Dec 08 '21 at 03:33
Based on the unique() logic, you shouldn't be getting any duplicate names unless you have some whitespace around the names. This would make something like 'Office 1' and 'Office 1 ' look the same when printed, but different when computed. — Brent, Dec 08 '21 at 05:49
ah yeah I had a few spaces in there. thats what did it! Thank you this is great! — aero8991, Dec 08 '21 at 17:14

score 1 · Answer 2 · answered Dec 08 '21 at 04:53

This takes a couple of steps, where you have to build intermediate dataframes, do a conditional, and filter based on the result of the conditions:

temp = (df
        .drop(columns = 'CUSTOMER_SUPPLIER_NUMBER')
        .groupby('NAME', sort = False)
        .idxmax()
         )

# get the booleans for rows less than 10
bools = df.loc(axis=1)['Sales':'Target'].lt(10)

# groupby for each NAME
bools =  bools.groupby(df.NAME, sort = False).all()

# conditions buildup
condlist = [~bool_check.Sales, ~bool_check.Calls, ~bool_check.Target]
choicelist = [temp.Sales, temp.Calls, temp.Target]
# you might have to figure out what to use for default
indices = np.select(condlist, choicelist, default = temp.Sales)

# get matching rows
df.loc[indices]
        NAME  CUSTOMER_SUPPLIER_NUMBER  Sales  Calls  Target
2   OFFICE 1                   2222278     97    167     288
5   OFFICE 2                   2222289      9    195     257
7   OFFICE 3                   1111111      5      2     287
9   OFFICE 4                   1111171     95    193     299
10  OFFICE 5                   1111191      9    193     298

Compare data series columns based on a column that contains duplicates

2 Answers2