How to compare 2 Dataframes on set columns using Groupby where the data is not organized ideally in one of them

Question

The problem that I am attempting to solve is to reconcile rates applied to the actual accounts shown in "AccountTable" versus what the should be set in "RateTable". For each Account its rate can be set at different levels, either at the Account level or at the Parent level. More than one account can be linked to the same parent but it would differ by currency for example. While I can get it to compare, my solution involves a lot of code which is repeated and its not scalable where this example only looks at 2 different grouping where I could have up to 9 different grouping combo's to compare.

Here is sample AccountTable:

import pandas as pd
import numpy as np
AccountTable = pd.DataFrame([[1234567890,456,'EUR',3.5],
                    [7854567890,15,'USD',2.7],
                    [9632587415,56,'GBP',1.4]],
columns = ['Account','ParentID','Cur','Rate'])
AccountTable

Output:

Account ParentID    Cur Rate
0   1234567890  456 EUR 3.5
1   7854567890  15  USD 2.7
2   9632587415  56  GBP 1.4

Here is the RateTable:

RateTable = pd.DataFrame([['Account',1234567890,'EUR',3.5], # Rate set at account level and shuold return a match
                    ['ParentID',456,'EUR',3.5], # should be Unused as match found at account level
                    ['ParentID',15,'USD',2.7],# rate set at account level and matches 
                    ['ParentID',15,'CAD',1.5],# CAD not in Account Table therfore unused 
                    ['Account',9876542190,'EUR',3.5], # Account Table therfore unused  
                    ['ParentID',56,'GBP',1.5]], # rate set on parent level but rates don't match so return should be mismatch here
columns = ['Level_Type','ID','Cur','Set_Rate'])

Output:

Level_Type  ID          Cur Set_Rate
0   Account 1234567890  EUR 3.5
1   ParentID 456        EUR 3.5
2   ParentID 15         USD 2.7
3   ParentID 15         CAD 1.5
4   Account 9876542190  EUR 3.5
5   ParentID 56         GBP 1.5

My solutions is as follows, where I split the RateTable into multiply Dataframes based on the different level. In this case 2 - Account level and Parent Level. Then I left join them independently to the AccountTable using the Groupby function and compare the rates.

option1 = ['Account']
option2 = ['ParentID']
AccountView = RateTable[RateTable['Level_Type'].isin(option1)]
ParentView = RateTable[RateTable['Level_Type'].isin(option2)]
AccountView = AccountView.rename(columns={'Set_Rate':'Account_Set_Rate'})
ParentView = ParentView.rename(columns={'Set_Rate':'Parent_Set_Rate'})
AccountView = AccountView.rename(columns={'ID':'Account_ID'})
ParentView = ParentView.rename(columns={'ID':'Parent_ID'})
# new view to identify matches at Account level Only 
df = pd.merge(AccountTable, AccountView, left_on=['Account','Cur'], right_on=['Account_ID','Cur'], how='left')
df['Account_level_RateMatch'] = np.where(df['Rate'] == df['Account_Set_Rate'],'1','0').astype(int)

Account ParentID    Cur Rate    Level_Type  Account_ID  Account_Set_Rate    Account_level_RateMatch
0   1234567890  456 EUR 3.5     Account     1.234568e+09    3.5             1
1   7854567890  15  USD 2.7     NaN         NaN             NaN             0
2   9632587415  56  GBP 1.4     NaN         NaN             NaN             0

The above is repeated but matching on the parent level now:

df = pd.merge(AccountTable, ParentView, left_on=['ParentID','Cur'], right_on=['Parent_ID','Cur'], how='left')
df['Parent_level_RateMatch'] = np.where(df['Rate'] == df['Parent_Set_Rate'],'1','0').astype(int) # compare rates

Output:

Account ParentID    Cur Rate    Level_Type  Parent_ID   Parent_Set_Rate Parent_level_RateMatch
0   1234567890  456 EUR 3.5     ParentID    456         3.5             1
1   7854567890  15  USD 2.7     ParentID    15          2.7             1
2   9632587415  56  GBP 1.4     ParentID    56          1.5             0

I need to have a better way to compare the Account Rates to the Rate table rather than doing separate views. Also, the logic needs to be if a match is found at the first level "Account level" its stops there and doesn't need to check the next level i.e. parent level such as in # row 1 it matches at both account and parent level.

Any thoughts or solutions would be greatly appreciated.

Desired Output:

Account ParentID    Cur Rate    IsMatch LevelFound
0   1234567890  456 EUR 3.5     1       Account
1   7854567890  15  USD 2.7     1       Parent
2   9632587415  56  GBP 1.4     0       Parent

@Ben.T I've added to the original post what the desired outcome should look like — Alan Paul, Nov 12 '21 at 14:54
ok let's say for the second account `7854567890`, there is a line in rateTable with the same Account but wrong rate. Then there is a match with Parent, do you want to keep the information that it does not match with the account or the matching one with the parent? — Ben.T, Nov 12 '21 at 15:03
For account ```7854567890 ```, there no matching account in the RateTable. In this case then, in the AccountTable each account will have a ParentID, for this account ```7854567890 ``` it is 15. Therefore we look for 15 in the RateTable and compare its rate which is 2.7 back to the AccountTable. Note, in my code I group by currency too to find its unique pairing. Does that make sense? — Alan Paul, Nov 12 '21 at 15:22

Ben.T · Accepted Answer · 2021-11-12T16:54:40.093

EDIT solution, similar to original one but to fit better OP's expected output

#define the order of the levels
ordered_levels = ['Account','ParentID'] 

# fnd all the matching rates
res = (
    pd.concat(
        [AccountTable
           .merge(RateTable.loc[RateTable['Level_Type'].eq(lvl),#row with good level
                                 # columns for comparision with AccountTable
                                ['ID','Cur','Set_Rate']]
                           .rename(columns={'ID':lvl, 'Cur':f'Cur_opt'}), 
                 on=lvl, how='inner')
           .query('Cur == Cur_opt') #EDIT to query same cur
           .assign(LevelFound=lvl, 
                   #EDIT if rate not the same then 0
                   Is_Match=lambda x: x['Rate'].eq(x['Set_Rate']).astype(int)) 
         for lvl in ordered_levels]) # do the merge operation on each level
    #EDIT for selecting first 1 if any, then first 0
    .sort_values('Is_Match', ascending=False) 
    # keep the first matched per initial AccountTable or higher level non-match
    .drop_duplicates(ordered_levels) 
    [AccountTable.columns.tolist() + ['LevelFound','Is_Match']] 
)
print(res)
#       Account  ParentID  Cur  Rate LevelFound  Is_Match
# 0  1234567890       456  EUR   3.5    Account         1
# 1  7854567890        15  USD   2.7   ParentID         1
# 3  9632587415        56  GBP   1.4   ParentID         0

Original solution

Here is a solution, you need first to define the order of the levels, then you can loop over each one, select the rows wanted in RateTable, then merge with Account, and keep only the matched cur and rate (query). concat all the matched data and keep only the first match per AccountTable initial row.

#define the order of the levels
ordered_levels = ['Account','ParentID'] 

# fnd all the matching rates
matched = (
    pd.concat(
        [AccountTable
           .merge(RateTable.loc[RateTable['Level_Type'].eq(lvl),#row with good level
                                # columns for comparision with AccountTable
                                ['ID','Cur','Set_Rate']] 
                           .rename(columns={'ID':lvl, 'Cur':f'Cur_opt'}), 
                 on=lvl, how='inner')
           # keep only the matching data
           .query('Cur == Cur_opt and Rate == Set_Rate')
           # add the two columns for the ouput
           .assign(LevelFound=opt, Is_Match=1)
         for lvl in ordered_levels]) # do the merge operation on each level
    .drop_duplicates(ordered_levels) # keep the first matched per initial AccountTable
    [AccountTable.columns.tolist() + ['LevelFound','Is_Match']] 
)
print(matched) # note that the row wihtout match is missing
#       Account  ParentID  Cur  Rate LevelFound  Is_Match
# 0  1234567890       456  EUR   3.5    Account         1
# 1  7854567890        15  USD   2.7   ParentID         1

If you want to add the rows with no match, then you can do

res = AccountTable.merge(matched, how='left')
print(res)
#       Account  ParentID  Cur  Rate LevelFound  Is_Match
# 0  1234567890       456  EUR   3.5    Account       1.0
# 1  7854567890        15  USD   2.7   ParentID       1.0
# 2  9632587415        56  GBP   1.4        NaN       NaN

Great Solution...it really is. One thing, the expected result for account 9632587415 should be a mismatch or '0' under Is_match and not NaN. While the account isn't found at the account level, it is available at the parent level under ParentID 56. — Alan Paul, Nov 12 '21 at 16:18
Great work Ben. It really solves the problem outlined. My next question is, how difficult is it to apply if/else conditions to this code? The above was a simplified version of what I really need to do. For example, if a rate field was blank it would apply a rate from another table. Should I submit a new question since this one was really addressed? — Alan Paul, Nov 12 '21 at 21:12
While an inner join was done. Is there a way to include Set_Rate from the RateTable on the final output? — Alan Paul, Nov 12 '21 at 23:19
@AlanPaul for Set_rate, just add it in the last line of code that select the output columns so `[AccountTable.columns.tolist() + ['LevelFound','Is_Match','Set_Rate']]` the inner merge is on the row, so this column is still available. for the if else condition, yes a new question would be better, refer to this one could help too :) — Ben.T, Nov 13 '21 at 00:43
Yes perfect I can add the additional columns for further analysis. Is line `.query('Cur == Cur_opt') #EDIT to query same cur` used as the grouping between both tables on how it matches? Is there a way to have different matching criteria per level? i.e. for Account lvl it should be Currency and Account, for Parent lvl is should be Currency, ParentID and maybe 3rd variable. Basically a way to further loop through the rows on different matching criteria to find matches/mismatches. — Alan Paul, Nov 14 '21 at 10:11

How to compare 2 Dataframes on set columns using Groupby where the data is not organized ideally in one of them

1 Answers1