Capping multiple dataframe columns based on values in other columns

Question

This is my first post in StackOverflow so hoping i don't break any posting conventions.

I have a dataframe that contains Acct_ID, Current_Balance, Credit_Limit and monthly EADs. Example Table

import pandas as pd
df = pd.DataFrame({'Acct_ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 
                   'Current_Balance': [8000, 7000, 6000, 8000, 7000, 6000, 3000, 2000, 5000, 5000],
                   'Credit_Limit': [10000, 9000, 8000, 7000, 6000, 5000, 4000, 3000, 2000, 1000],
                   'EAD_1': [8500, 7500, 6500, 8500, 7500, 6500, 5500, 4500, 3500, 500],
                   'EAD_2': [9500, 8500, 7500, 6500, 5500, 4500, 3500, 2500, 1500, 1500],
                   'EAD_3': [10500, 9500, 8500, 7500, 6500, 5500, 4500, 3500, 2500, 2000],
                   'EAD_4': [12000, 11000, 10000, 9000, 8000, 7000, 6000, 5000, 4000, 3000]})

I need to apply a cap to the EAD columns that uses the following logic:

If current balance exceed credit limit, then monthly EAD for each observation is the smaller of EAD and current balance. Alternatively, if current balance does not exceed credit limit, then monthly EAD for each observation is the smaller of EAD and credit limit. A floor of 0 is also being applied to the monthly EADs.

My background is in SAS and I have only recently started working with Python so struggling to find an efficient solution. I am including below the SAS equivalent for the outlined logic purely for reference:

if Current_Balance > Credit_limit then do;
    EAD_&j. = Max(Min(EAD_&j.,Current_Balance),0);
end;

else do;
    EAD_&j. = Max(Min(EAD_&j.,Credit_limit),0); 
end;

I already have a working solution in Python but that takes ages to run. The sample i am currently working with has 325k observations and I need to apply the "capping" to 120 columns with monthly EADs. At the moment the below code in Python takes 40 mins to run. I dare not run this on the whole population of 12 million records...Just as a comparison, it takes me 2-3 minutes to do the same in SAS using the script noted above.

def EAD_LT_adjustment(curr_bal, credit_limit, EAD_t):
    if curr_bal > credit_limit:
        return max(min(EAD_t, curr_bal), 0)
    else:
        return max(min(EAD_t, credit_limit), 0)

for k in range(13, lifetime + 1):
    EL_Acc2['EAD_LT_T' + str(k)] = EL_Acc2.apply(lambda x: EAD_LT_adjustment(x['curr_bal'], x['credit_limit'], x['EAD_LT_T' + str(k)]), axis=1)

Any ideas how to optimise the code? The objective is to get to the same result but in significantly less time.

Thank you.

Welcome to Stackoverflow. There are a few conventions that you should try to follow! Most relevant to Pandas questions is to not post pictures of dataframes but (small examples of) dataframes themselves, per this: [how to provide a great pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). Also you may find this to be very useful as well: [mcve] — piterbarg, Nov 25 '20 at 22:49
Thank you. I have edited my original post, removing the picture link and replacing it with an example df. — Simorgh, Nov 25 '20 at 23:15

score 0 · Accepted Answer · answered Nov 25 '20 at 23:25

DataFrames are well-suited for column-wise operations which is exactly what is needed for your rules. for example, this is how we apply your rule to EAD_1 to come up with a new column EAD_1_c

df.loc[df['Current_Balance'] > df['Credit_Limit'],'EAD_1_c'] = df[['EAD_1','Current_Balance']].min(axis=1).clip(0)
df.loc[df['Current_Balance'] <= df['Credit_Limit'],'EAD_1_c'] = df[['EAD_1','Credit_Limit']].min(axis=1).clip(0)

output:

      Acct_ID    Current_Balance    Credit_Limit    EAD_1    EAD_2    EAD_3    EAD_4    EAD_1_c
--  ---------  -----------------  --------------  -------  -------  -------  -------  ---------
 0          1               8000           10000     8500     9500    10500    12000       8500
 1          2               7000            9000     7500     8500     9500    11000       7500
 2          3               6000            8000     6500     7500     8500    10000       6500
 3          4               8000            7000     8500     6500     7500     9000       8000
 4          5               7000            6000     7500     5500     6500     8000       7000
 5          6               6000            5000     6500     4500     5500     7000       6000
 6          7               3000            4000     5500     3500     4500     6000       4000
 7          8               2000            3000     4500     2500     3500     5000       3000
 8          9               5000            2000     3500     1500     2500     4000       3500
 9         10               5000            1000      500     1500     2000     3000        500

You can loop over all EAD_n like in your own solution. I would expect this to be a lot faster than your row by row looping

See here and here for further details

Capping multiple dataframe columns based on values in other columns

1 Answers1