string matching with NLP

Question

I have two dataframes, df1 and df2, with ~40,000 rows and ~70,000 rows respectively of data about polling stations in country A.

The two dataframes have some common columns like 'polling_station_name', 'province', 'district' etc., however df1 has latitude and longitude columns, whereas df2 doesn't, so I am trying to do string matching between the two dataframes so at least some rows of df2 will have geolocations available. I am blocking on the 'district' column while doing the string matching.

This is the code that I have so far:

import recordlinkage
from recordlinkage.standardise import clean
indexer = recordlinkage.Index()
indexer.block('district')
candidate_links = indexer.index(df1, df2) 

compare = recordlinkage.Compare()
compare.string('polling_station_name', 'polling_station_name', method='damerau_levenshtein', threshold=0.75)
compare_vectors = compare.compute(candidate_links, df1, df2)

This produced about 12,000 matches, however I have noticed that some polling station names are incorrectly being matched because their names are very similar when they are in different locations - e.g. 'government girls primary school meilabu' and 'government girls primary school muzaka' are clearly different, yet they are being matched.

I think utilising NLP might help here, to see if there are certain words that occur very frequently in the data, like 'government', 'girls', 'boys', 'primary', 'school', etc. so I can put less emphasis on those words, and put more emphasis on meilabu, muzaka etc. while doing the string matching, but I am not so sure where to start. (For reference, many of the polling stations are 'government (i.e.public) schools')

Any advice would be greatly appreciated!

score 0 · Answer 1 · answered Jul 27 '20 at 08:37

0

The topic is very broad, just pay attention to standard approaches:

TFIDF: term frequency–inverse document frequency is often used as a weighting factor.
Measure similarity between two sentences using cosine similarity

answered Jul 27 '20 at 08:37

ipj

3,488
1
14
18

score 0 · Answer 2 · answered Jul 27 '20 at 18:59

@ipj said it correct, the topic is very broad. You can try out below methods,

def get_sim_measure(sentence1, sentence1):
    vec1 = get_vector(sentence1)
    vec2 = get_vector(sentence2)
    return cosine_similarity(vec1, vec2)

Now the get_vector method can be many things.

Remove the stop words first and then you can use word2vec, GloVe on a word level and average them for the sentence. (simple)
Use doc2vec from Gensim for vector embedding of the sentence. (medium)
Use Bert (DistilBert or something lighter) for dynamic embedding with context. (hard)
Use TF-IDF and then use GloVe embedding. (simple)
Use spaCy's entity recognition and then do similarity matching (in this case words from government girls primary school will act as stop words) on entity labels. (slow process but simple)
Use BleuScore for measuring the similar words (in case you need it). (maybe misguiding)

There can be many situations, so better give few simple ones a try and go ahead.

string matching with NLP

2 Answers2