I am using python's recordlinkage
toolkit to string match school name columns from two dataframes
, df1
and df2
, while blocking on their common column 'division'
.
My code is as below:
import recordlinkage
from recordlinkage.standardise import clean
indexer = recordlinkage.Index()
indexer.block('division')
candidate_links_2 = indexer.index(df1, df2)
compare = recordlinkage.Compare()
compare.string('school_name', 'school_name', method='jaro', threshold=0.95)
compare_vectors_2 = compare.compute(candidate_links_2, df1, df2)
matches_2 = compare_vectors_2[compare_vectors_2[0]==1.0]
matches_2.reset_index(inplace=True)
matches_2 = matches_2.rename(columns={'level_0': 'df1_index', 'level_1': 'df2_index', 0: 'match'})
matches_duplicates_dropped_2 = matches_2.drop_duplicates(subset = ['df1_index'], keep = 'last')
matches_duplicates_dropped_2
With this code, I am getting some matches that are not supposed to match -
df1_index df2_index match df1_school df2_school
2 15136 26636 1.0 girls middle 34sp girls middle 53sp
4 15137 26636 1.0 girls middle 34sp girls middle 53sp
7 15148 26636 1.0 girls middle 35sp girls middle 53sp
10 15149 26636 1.0 girls middle 35sp girls middle 53sp
... ... ... ... ... ...
43794 64087 6601 1.0 islamabad model i v2 i 91ii islamabad model i v2 i 101ii
I want to create a separate rule where if there are numbers in school_name, e.g. 'girls middle 35sp'
has number 35
etc., the numbers have to be 'exact match', so 'girls middle 35sp'
should only be matched with 'girls middle 35sp'
, and shouldn't get matched with 'girls middle 53sp'
etc.
I wonder if that's even a possibility, or whether there is a better toolkit out there for this particular purpose.