1

I have 2 dataframes one dataframe(df1) contains columns like- ISIN, Name, Currency, Value, % Weight, Asset type., comments and assumptions

So this dataframe looks like this:- df1

   ISIN                                 Name Currency       Value  % Weight  Asset Type                              Comments/ Assumptions
0   NaN     Transcanada Trust 5.875 08/15/76      USD  7616765.00    0.0176         NaN  https://assets.cohenandsteers.com/assets/conte...
1   NaN      Bp Capital Markets Plc Flt Perp      USD  7348570.50    0.0169         NaN  Holding value for each constituent is derived ...
2   NaN       Transcanada Trust Flt 09/15/79      USD  7341250.00    0.0169         NaN                                                NaN
3   NaN      Bp Capital Markets Plc Flt Perp      USD  6734022.32    0.0155         NaN                                                NaN
4   NaN  Prudential Financial 5.375% 5/15/45      USD  6508290.68    0.0150         NaN                                                NaN
(241, 7)

whereas I have another dataframe df2 having columns like- Short Name, ISIN.

This dataframe looks like this.

  Short Name          ISIN
0  ABU DHABI COMMER  AEA000201011
1  ABU DHABI NATION  AEA002401015
2  ABU DHABI NATION  AEA006101017
3  ADNOC DRILLING C  AEA007301012
4  ALPHA DHABI HOLD  AEA007601015
(66987, 2)

I developed a logic that compares Name(from df1) and Short Name(from df2) based on a match it extracts relevant ISIN(from df2) into df1(ISIN column which is empty at present). Here's the logic for the same

def strMergeData(strColumnDf1):
    strColumnDf1 = strColumnDf1.split()[0]
    for strColumnDf2 in df2['Short Name']:
        if strColumnDf1 in strColumnDf2:
            return df2[df2['Short Name'] == strColumnDf2]['ISIN'].values[0]
            break
        else:
            pass
        
df1['ISIN'] = df1.apply(lambda x: strMergeData(x['Name']),axis=1)
print(df1)

which gives the output as :

ISIN                                 Name Currency       Value  % Weight  Asset Type                              Comments/ Assumptions
0   NA     Transcanada Trust 5.875 08/15/76      USD  7616765.00    0.0176         NaN  https://assets.cohenandsteers.com/assets/conte...
1   NA      Bp Capital Markets Plc Flt Perp      USD  7348570.50    0.0169         NaN  Holding value for each constituent is derived ...
2   NA       Transcanada Trust Flt 09/15/79      USD  7341250.00    0.0169         NaN                                                NaN
3   NA      Bp Capital Markets Plc Flt Perp      USD  6734022.32    0.0155         NaN                                                NaN
4   NA  Prudential Financial 5.375% 5/15/45      USD  6508290.68    0.0150         NaN                                                NaN

The end result should look like this however because of the logic(which actually compares Name and Short Name word by word) it takes the first occurrence in the dataframe and straightaway gives ISIN which is incorrect. For eg: for Name- Bank of Scotland ISIN is 1324fdd is written as 1345o

as a result, I developed a new logic using fuzzywuzzy module which shows the exact match, if a match is not relevant wrt Name then it shows null. Here's the logic.

mat1 = []
mat2 = []
p = []

# converting dataframe column
# to list of elements
# to do fuzzy matching
list1 = df1['Name'].tolist()
list2 = df2['Short Name'].tolist()

# taking the threshold as 80
threshold = 93

# iterating through list1 to extract
# it's closest match from list2
for i in list1:
    mat1.append(process.extractOne(i, list2, scorer=fuzz.token_set_ratio))
df1['matches'] = mat1

# iterating through the closest matches
# to filter out the maximum closest match
for j in df1['matches']:
    if j[1] >= threshold:
        p.append(j[0])
    mat2.append(",".join(p))
    p = []

# storing the resultant matches back
# to df1
df1['matches'] = mat2
print("\nDataFrame after Fuzzy matching using token_set_ratio():")
print(df1.tail())

and the output that I get is this:

ISIN                                      Name Currency         Value  % Weight  Asset Type Comments/ Assumptions          matches
236   NaN            Partnerre Ltd 4.875% Perp Sr:J      USD  1.684069e+05    0.0004         NaN                   NaN
237   NaN  Berkley (Wr) Corporation 5.700% 03/30/58      USD  6.955837e+04    0.0002         NaN                   NaN
238   NaN             Tc Energy Corp Flt Perp Sr:11      USD  6.380262e+04    0.0001         NaN                   NaN   TC ENERGY CORP
239   NaN                      Cash and Equivalents      USD  2.166579e+07    0.0499         NaN                   NaN
240   NaN                                       AUM      NaN  4.338766e+08    0.9999         NaN                   NaN  AUM IND BARC US

This output basically adds a match column on df1 and constitutes which ShortName(from df1) matches Name(from df1) however doesn't add any ISIN.

How do I add ISIN from df2 to df1 based on the above logic(fuzzywuzzy) so that in the new dataframe(df3) I get the output as:

ISIN                                 Name Currency       Value  % Weight  Asset Type                              Comments/ Assumptions
0   NA     Transcanada Trust 5.875 08/15/76      USD  7616765.00    0.0176         NaN  https://assets.cohenandsteers.com/assets/conte...
1   NA      Bp Capital Markets Plc Flt Perp      USD  7348570.50    0.0169         NaN  Holding value for each constituent is derived ...
2   NA       Transcanada Trust Flt 09/15/79      USD  7341250.00    0.0169         NaN                                                NaN
3   NA      Bp Capital Markets Plc Flt Perp      USD  6734022.32    0.0155         NaN                                                NaN
4   NA  Prudential Financial 5.375% 5/15/45      USD  6508290.68    0.0150         NaN                                                NaN

Please help.

technophile_3
  • 531
  • 6
  • 21

1 Answers1

0

One option is to use recordlinkage: https://recordlinkage.readthedocs.io/en/latest/

The code below is a quick hack, so will probably need fixing:


import recordlinkage

# Indexation step
indexer = recordlinkage.Index()
indexer.add(recordlinkage.index.Full())
candidate_links = indexer.index(df1, df2)

# Comparison step
compare_cl = recordlinkage.Compare()

compare_cl.string('Name', 'Short Name', label='name_similarity', method='jarowinkler', threshold=0.85)

matches = compare_cl.compute(candidate_links, df1, df2)
SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46