I have two dataframes
, df1
and df2
, with ~40,000 rows and ~70,000 rows respectively of data about polling stations in country A.
The two dataframes have some common columns like 'polling_station_name', 'province', 'district' etc., however df1
has latitude and longitude columns, whereas df2
doesn't, so I am trying to do string matching
between the two dataframes so at least some rows of df2
will have geolocations available. I am blocking on the 'district' column while doing the string matching.
This is the code that I have so far:
import recordlinkage
from recordlinkage.standardise import clean
indexer = recordlinkage.Index()
indexer.block('district')
candidate_links = indexer.index(df1, df2)
compare = recordlinkage.Compare()
compare.string('polling_station_name', 'polling_station_name', method='damerau_levenshtein', threshold=0.75)
compare_vectors = compare.compute(candidate_links, df1, df2)
This produced about 12,000 matches, however I have noticed that some polling station names are incorrectly being matched because their names are very similar when they are in different locations - e.g. 'government girls primary school meilabu'
and 'government girls primary school muzaka'
are clearly different, yet they are being matched.
I think utilising NLP might help here, to see if there are certain words that occur very frequently in the data, like 'government'
, 'girls'
, 'boys'
, 'primary'
, 'school'
, etc. so I can put less emphasis on those words, and put more emphasis on meilabu
, muzaka
etc. while doing the string matching, but I am not so sure where to start.
(For reference, many of the polling stations are 'government (i.e.public) schools')
Any advice would be greatly appreciated!