I have 2 dataframes one dataframe(df1) contains columns like- ISIN, Name, Currency, Value, % Weight, Asset type., comments and assumptions
So this dataframe looks like this:- df1
ISIN Name Currency Value % Weight Asset Type Comments/ Assumptions
0 NaN Transcanada Trust 5.875 08/15/76 USD 7616765.00 0.0176 NaN https://assets.cohenandsteers.com/assets/conte...
1 NaN Bp Capital Markets Plc Flt Perp USD 7348570.50 0.0169 NaN Holding value for each constituent is derived ...
2 NaN Transcanada Trust Flt 09/15/79 USD 7341250.00 0.0169 NaN NaN
3 NaN Bp Capital Markets Plc Flt Perp USD 6734022.32 0.0155 NaN NaN
4 NaN Prudential Financial 5.375% 5/15/45 USD 6508290.68 0.0150 NaN NaN
(241, 7)
whereas I have another dataframe df2 having columns like- Short Name, ISIN.
This dataframe looks like this.
Short Name ISIN
0 ABU DHABI COMMER AEA000201011
1 ABU DHABI NATION AEA002401015
2 ABU DHABI NATION AEA006101017
3 ADNOC DRILLING C AEA007301012
4 ALPHA DHABI HOLD AEA007601015
(66987, 2)
I developed a logic that compares Name(from df1) and Short Name(from df2) based on a match it extracts relevant ISIN(from df2) into df1(ISIN column which is empty at present). Here's the logic for the same
def strMergeData(strColumnDf1):
strColumnDf1 = strColumnDf1.split()[0]
for strColumnDf2 in df2['Short Name']:
if strColumnDf1 in strColumnDf2:
return df2[df2['Short Name'] == strColumnDf2]['ISIN'].values[0]
break
else:
pass
df1['ISIN'] = df1.apply(lambda x: strMergeData(x['Name']),axis=1)
print(df1)
which gives the output as :
ISIN Name Currency Value % Weight Asset Type Comments/ Assumptions
0 NA Transcanada Trust 5.875 08/15/76 USD 7616765.00 0.0176 NaN https://assets.cohenandsteers.com/assets/conte...
1 NA Bp Capital Markets Plc Flt Perp USD 7348570.50 0.0169 NaN Holding value for each constituent is derived ...
2 NA Transcanada Trust Flt 09/15/79 USD 7341250.00 0.0169 NaN NaN
3 NA Bp Capital Markets Plc Flt Perp USD 6734022.32 0.0155 NaN NaN
4 NA Prudential Financial 5.375% 5/15/45 USD 6508290.68 0.0150 NaN NaN
The end result should look like this however because of the logic(which actually compares Name and Short Name word by word) it takes the first occurrence in the dataframe and straightaway gives ISIN which is incorrect. For eg: for Name- Bank of Scotland ISIN is 1324fdd is written as 1345o
as a result, I developed a new logic using fuzzywuzzy
module which shows the exact match, if a match is not relevant wrt Name then it shows null. Here's the logic.
mat1 = []
mat2 = []
p = []
# converting dataframe column
# to list of elements
# to do fuzzy matching
list1 = df1['Name'].tolist()
list2 = df2['Short Name'].tolist()
# taking the threshold as 80
threshold = 93
# iterating through list1 to extract
# it's closest match from list2
for i in list1:
mat1.append(process.extractOne(i, list2, scorer=fuzz.token_set_ratio))
df1['matches'] = mat1
# iterating through the closest matches
# to filter out the maximum closest match
for j in df1['matches']:
if j[1] >= threshold:
p.append(j[0])
mat2.append(",".join(p))
p = []
# storing the resultant matches back
# to df1
df1['matches'] = mat2
print("\nDataFrame after Fuzzy matching using token_set_ratio():")
print(df1.tail())
and the output that I get is this:
ISIN Name Currency Value % Weight Asset Type Comments/ Assumptions matches
236 NaN Partnerre Ltd 4.875% Perp Sr:J USD 1.684069e+05 0.0004 NaN NaN
237 NaN Berkley (Wr) Corporation 5.700% 03/30/58 USD 6.955837e+04 0.0002 NaN NaN
238 NaN Tc Energy Corp Flt Perp Sr:11 USD 6.380262e+04 0.0001 NaN NaN TC ENERGY CORP
239 NaN Cash and Equivalents USD 2.166579e+07 0.0499 NaN NaN
240 NaN AUM NaN 4.338766e+08 0.9999 NaN NaN AUM IND BARC US
This output basically adds a match column on df1 and constitutes which ShortName(from df1) matches Name(from df1) however doesn't add any ISIN.
How do I add ISIN from df2 to df1 based on the above logic(fuzzywuzzy
) so that in the new dataframe(df3) I get the output as:
ISIN Name Currency Value % Weight Asset Type Comments/ Assumptions
0 NA Transcanada Trust 5.875 08/15/76 USD 7616765.00 0.0176 NaN https://assets.cohenandsteers.com/assets/conte...
1 NA Bp Capital Markets Plc Flt Perp USD 7348570.50 0.0169 NaN Holding value for each constituent is derived ...
2 NA Transcanada Trust Flt 09/15/79 USD 7341250.00 0.0169 NaN NaN
3 NA Bp Capital Markets Plc Flt Perp USD 6734022.32 0.0155 NaN NaN
4 NA Prudential Financial 5.375% 5/15/45 USD 6508290.68 0.0150 NaN NaN
Please help.