0

I have two dataframes, df1 and df2, with ~40,000 rows and ~70,000 rows respectively of data about polling stations in country A.

The two dataframes have some common columns like 'polling_station_name', 'province', 'district' etc., however df1 has latitude and longitude columns, whereas df2 doesn't, so I am trying to do string matching between the two dataframes so at least some rows of df2 will have geolocations available. I am blocking on the 'district' column while doing the string matching.

This is the code that I have so far:

import recordlinkage
from recordlinkage.standardise import clean
indexer = recordlinkage.Index()
indexer.block('district')
candidate_links = indexer.index(df1, df2) 

compare = recordlinkage.Compare()
compare.string('polling_station_name', 'polling_station_name', method='damerau_levenshtein', threshold=0.75)
compare_vectors = compare.compute(candidate_links, df1, df2)

This produced about 12,000 matches, however I have noticed that some polling station names are incorrectly being matched because their names are very similar when they are in different locations - e.g. 'government girls primary school meilabu' and 'government girls primary school muzaka' are clearly different, yet they are being matched.

I think utilising NLP might help here, to see if there are certain words that occur very frequently in the data, like 'government', 'girls', 'boys', 'primary', 'school', etc. so I can put less emphasis on those words, and put more emphasis on meilabu, muzaka etc. while doing the string matching, but I am not so sure where to start. (For reference, many of the polling stations are 'government (i.e.public) schools')

Any advice would be greatly appreciated!

dmswjd
  • 43
  • 7

2 Answers2

0

The topic is very broad, just pay attention to standard approaches:

  • TFIDF: term frequency–inverse document frequency is often used as a weighting factor.
  • Measure similarity between two sentences using cosine similarity
ipj
  • 3,488
  • 1
  • 14
  • 18
0

@ipj said it correct, the topic is very broad. You can try out below methods,

def get_sim_measure(sentence1, sentence1):
    vec1 = get_vector(sentence1)
    vec2 = get_vector(sentence2)
    return cosine_similarity(vec1, vec2)

Now the get_vector method can be many things.

  • Remove the stop words first and then you can use word2vec, GloVe on a word level and average them for the sentence. (simple)
  • Use doc2vec from Gensim for vector embedding of the sentence. (medium)
  • Use Bert (DistilBert or something lighter) for dynamic embedding with context. (hard)
  • Use TF-IDF and then use GloVe embedding. (simple)
  • Use spaCy's entity recognition and then do similarity matching (in this case words from government girls primary school will act as stop words) on entity labels. (slow process but simple)
  • Use BleuScore for measuring the similar words (in case you need it). (maybe misguiding)

There can be many situations, so better give few simple ones a try and go ahead.

Shahidur
  • 311
  • 2
  • 6