0

I am working with a dataframe which has 2 columns of city names which should be equal. But they are not due to administrative errors, spelling mistakes or name changes. I am trying to see when those city names are 'equal enough' to be assumed equal. Using SequenceMatcher I can divide the list in roughly 3 parts: Everything is wrong, some are wrong, some are right, everything is right.

In a perfect world I would want the list to be divided in: Everything is wrong, Everything is right. Where the split can be made around a certain ratio/matching value.

Therefore, SequenceMatcher does not do the trick for me. I found textdistance but I get overwhelmed by the possibilities and probably there are more possibilities. An example where it goes wrong:

'Zeddam' and 'Didam' are classified with a ratio of 0.72 (Which are not equal). 'Nes' and 'Nes gem dongeradeel' with a ratio of 0.45 (Which are equal, the second just specifies it's province)

Just checking if one of the strings is a subset of the other string does not do the trick since it will give problems in other cases.

Do you guys have a suggestion on what string comparing algorithm is appropriate and why? I am comparing multiple columns in my dataframe, which is around 1000 rows.

Hestaron
  • 190
  • 1
  • 8
  • the levenschtein package in python has quite a few options that should give you what you want. Also it might be worth checking if it's a subset as well as some distance matching and using a combined weight of them to suit your purposes – Chris Sep 17 '20 at 14:25
  • Maybe quite crude, but what about the length of the longest common substring, normalized by the length of the shorter of the two strings? 'Zeddam' and 'Didam' would have a score of 0.6 (still high, but these names are similar, there's not much to be done), and 'Nes' with 'Nes Gem Dongeradeel' would have a score of 1.0. Although this wouldn't work too well with spelling mistakes. I think the best here is to combine several metrics, instead of using just one. – Anakhand Sep 17 '20 at 14:28
  • I tried different comparisons from the package textdistance and levenshtein. I tried checking the subset thing and already made some linear combination. Levenshtein makes things worse, I also played around with Jaro-Winkler but I still have to set the score boundary at 0.625 to have a boundary at which everything above is 99,99% good and below it is only for 80% correct. But still I have a hard time setting a boundary lower such that I can include more 'correct' combinations. Do you perhaps have more suggestions ? – Hestaron Sep 22 '20 at 14:50

0 Answers0