1

The dataset I have is manually filled addresses. The data is big and has a LOT of variations.

The address column contains information of the full address from apart number to city and street name to neighbor name and the city. Since it's manually filled there are a lot of typos.

The city I wanna look for is 'İstanbul'. It has a Turkish character and I'm running into some encoding issues as well. For example, lower()'ing the İ in İstanbul won't return me a character I can pick up with a regular 'i' in a regex pattern.

Therefore, as well as other reasons, I changed my approach to fuzzy string searching. I want to give to reference strings to my fuzzy lookup algorithm: '/ist' and 'İstanbul' — these are the reference values to be looked up for in my dataframe's address column.

Example of rows with phrases I want to catch:

...İSYTANBUL...
...isanbul...
...Istanbul...
...İ/STANBUL...
...,STANBUL/ÜSKÜDAR...
isatanbul
iatanbul
İSTRANBUL
isytanbul
/isanbul

I've tried my luck with fuzzywuzzy but a simple fuzz.ratio('istanbul', 'İSTANBUL') returns me a 0 ratio between words. How can I make fuzzywuzzy or other libraries pick these patterns up?

  • `fuzz.ratio('istanbul', 'İSTANBUL'.lower())` gives 94. It is a good idea to lowercase your strings to make them more comparable. – Stefano Fiorucci - anakin87 Mar 25 '21 at 08:49
  • Bring them in cannonical form (e.g. same casing, transliterate). – Simon Mar 25 '21 at 08:49
  • Probably also find out about Unicode normalization. NFC might expose exactly what fuzzywuzzy needs, or else try to remove the joining accents. There are many questions with recipes for how to do that. – tripleee Mar 25 '21 at 08:51

0 Answers0