0

I have a pandas dataframe as below.I want to create list of columns by iterating over list called 'fields_list' and separate out lists which ends with the list in 'fields_list'

import pandas as pd
import numpy as np
import sys
df = pd.DataFrame({'a_balance': [3,4,5,6], 'b_balance': [5,1,1,1]})
df['ah_balance'] = 0
df['a_agg_balance'] = 0
df['b_agg_balance'] = 0
df

a_balance   b_balance   ah_balance  a_agg_balance   b_agg_balance
3           5           0           0               0
4           1           0           0               0
5           1           0           0               0
6           1           0           0               0

fields_list =   [ ['<val>','_balance'],['<val_class>','_agg_balance']]
fields_list
[['<val>', '_balance'], ['<val_class>', '_agg_balance']]

for i,field in fields_list:
    df_final= [col for col in df if col.endswith(field)]
    print("df_final" ,df_final)

I tried above code but when it iterates over 1st element of fields_list(i.e. '', '_balance') it also includes elements that ends with '_agg_balance' and hence I get the below result

df_final ['a_balance', 'b_balance', 'ah_balance', 'a_agg_balance', 'b_agg_balance']
df_final ['a_agg_balance', 'b_agg_balance']

My expected output is

df_final ['a_balance', 'b_balance', 'ah_balance']
df_final ['a_agg_balance', 'b_agg_balance']
Shanoo
  • 1,185
  • 1
  • 11
  • 38
  • are you asking about these two specific suffixes (balance and egg_balance), or are you looking for a generic answer about any potential suffix? If it's the former, you have to define the question better - if there are some two suffixes that work for a field, which one is 'better'? – Roy2012 May 26 '20 at 19:36
  • Ya there might be more which will be defined under 'fields_list'. So basically we need to have a list created for all the suffix which is defined under 'fields_list – Shanoo May 26 '20 at 19:39

1 Answers1

0

You can sort the suffixes you're looking at, and start with the longest one. When you find a column that matches a suffix, remove it from the set of columns you need to look at:

fields_list =   [ ['<val>','_balance'],['<val_class>','_agg_balance']]

sorted_list = sorted(fields_list, key=lambda x: len(x[1]), reverse = True)
sorted_suffixes = [x[1] for x in sorted_list]

col_list = set(df.columns)
for suffix in sorted_suffixes:

    forecast_final_fields = [col for col in col_list if col.endswith(suffix)]
    col_list.difference_update(forecast_final_fields)
    print("forecast_final_fields" ,forecast_final_fields)

Results in

forecast_final_fields ['a_agg_balance', 'b_agg_balance']
forecast_final_fields ['ah_balance', 'a_balance', 'b_balance']
Roy2012
  • 11,755
  • 2
  • 22
  • 35