Is there a way to filter what the python-levenshtein extension changes?

Question

Ive got a large list of names (strings) that I have to check against each other to see if there are any typos.

To do this I've been using the pypi python-Levenshtein extension against the iterated list, with a typo being considered as a comparison with a Levenshtein distance of 1.

I am running into a problem with names such as 'cat 1' and 'cat 2' which are clearly ~different cats~ (not a typo), but are being flagged because their Levenshtein distance is 1.

I've tried putting a stop in before hand to check the string for any numbers , but as the list is quite long it doesn't do much for efficency.

Ideally, Im looking for a way to specify that if the only character changing is an int (ie. 'cat 1' vs 'cat 2'), then it is not considered a typo

Any suggestions for a different extension/method is welcomed, my greatest concern is efficiency, as mentioned - I have a big list

sample data, current effort, current output and expected output — Haleemur Ali, Jun 13 '18 at 01:21
Are the words English language words? Could you just spell check it? Also, couldn't you do your current process and then take that result and check for integer changes rather than checking for numbers as you go? What size data are we talking about here? What's your current code that is producing the distances? — Zev, Jun 13 '18 at 02:11

Is there a way to filter what the python-levenshtein extension changes?

0 Answers0