The dataset I have is manually filled addresses. The data is big and has a LOT of variations.
The address column contains information of the full address from apart number to city and street name to neighbor name and the city. Since it's manually filled there are a lot of typos.
The city I wanna look for is 'İstanbul'. It has a Turkish character and I'm running into some encoding issues as well. For example, lower()'ing the İ in İstanbul won't return me a character I can pick up with a regular 'i' in a regex pattern.
Therefore, as well as other reasons, I changed my approach to fuzzy string searching. I want to give to reference strings to my fuzzy lookup algorithm: '/ist
' and 'İstanbul'
— these are the reference values to be looked up for in my dataframe's address column.
Example of rows with phrases I want to catch:
...İSYTANBUL...
...isanbul...
...Istanbul...
...İ/STANBUL...
...,STANBUL/ÜSKÜDAR...
isatanbul
iatanbul
İSTRANBUL
isytanbul
/isanbul
I've tried my luck with fuzzywuzzy but a simple fuzz.ratio('istanbul', 'İSTANBUL')
returns me a 0 ratio between words. How can I make fuzzywuzzy or other libraries pick these patterns up?