0

I am using python's recordlinkage toolkit to string match school name columns from two dataframes, df1 and df2, while blocking on their common column 'division'.

My code is as below:

import recordlinkage
from recordlinkage.standardise import clean

indexer = recordlinkage.Index()
indexer.block('division')
candidate_links_2 = indexer.index(df1, df2)
compare = recordlinkage.Compare()
compare.string('school_name', 'school_name', method='jaro', threshold=0.95)
compare_vectors_2 = compare.compute(candidate_links_2, df1, df2)

matches_2 = compare_vectors_2[compare_vectors_2[0]==1.0]
matches_2.reset_index(inplace=True) 
matches_2 = matches_2.rename(columns={'level_0': 'df1_index', 'level_1': 'df2_index', 0: 'match'})

matches_duplicates_dropped_2 = matches_2.drop_duplicates(subset = ['df1_index'], keep = 'last') 
matches_duplicates_dropped_2

With this code, I am getting some matches that are not supposed to match -

    df1_index   df2_index   match   df1_school  df2_school
2   15136   26636   1.0 girls middle 34sp   girls middle 53sp
4   15137   26636   1.0 girls middle 34sp   girls middle 53sp
7   15148   26636   1.0 girls middle 35sp   girls middle 53sp
10  15149   26636   1.0 girls middle 35sp   girls middle 53sp
... ... ... ... ... ...
43794   64087   6601    1.0 islamabad model i v2 i 91ii islamabad model i v2 i 101ii

I want to create a separate rule where if there are numbers in school_name, e.g. 'girls middle 35sp' has number 35 etc., the numbers have to be 'exact match', so 'girls middle 35sp' should only be matched with 'girls middle 35sp', and shouldn't get matched with 'girls middle 53sp' etc.

I wonder if that's even a possibility, or whether there is a better toolkit out there for this particular purpose.

ejcho
  • 19
  • 4

1 Answers1

1

You should try threshold = 1.

From the documentation:

threshold (float, tuple of floats) – A threshold value. All approximate string comparisons higher or equal than this threshold are 1. Otherwise 0.

sneeu
  • 2,602
  • 3
  • 24
  • 27
DjBalcony
  • 93
  • 6
  • Then other school names that are the same won't get matched - e.g. 'girls primary school islamabad' and 'girls primary school islamabaad' (they are the same schools, with a very tiny spelling mistake in one) – ejcho Aug 24 '20 at 13:50