Fuzzy string matching using Difflib get_matching_blocks not detecting all substrings

Question

I'm trying to find all occurrences of a word in paragraph and I want it to account for spelling mistakes as well. Code:

to_search="caterpillar"
search_here= "caterpillar are awesome animal catterpillar who like other humans but not other caterpilar"
#search_here has the word caterpillar repeated but with spelling mistakes

s= SequenceMatcher(None, to_search, search_here).get_matching_blocks()
print(s)

#Output  : [Match(a=0, b=0, size=11), Match(a=3, b=69, size=0)] 
#Expected: [Match(a=0, b=0, size=11), Match(a=0, b=32, size=11), Match(a=0, b=81, size=11)]

Difflib get_matching_blocks only detects the first instance of "caterpillar" in the search_here string. I want it to give me output of all closely matching blocks i.e. it should identify "caterpillar","catterpillar" and "caterpilar"

How can I solve this problem?

When you look for the difference in texts using diffing you don’t expect to find ALL possible differences between the two texts. It will give you one (1) estimate on how different the strings are and how much has to change in either of the inputs to get the other. You are using the wrong tool for the job. — NewPythonUser, May 31 '20 at 06:22

allyourcode · Accepted Answer · 2020-05-31T06:52:18.063

0

You could calculate the edit distance of each word vs. to_search. Then, you can select all the words that have "low enough" edit distance (a score of 0 means exact match).

Thanks to your question, I have discovered that there is a pip-install-able edit_distance Python module. Here are a couple examples I just tried out for the first time:

>>> edit_distance.SequenceMatcher('fabulous', 'fibulous').ratio()
0.875
>>> edit_distance.SequenceMatcher('fabulous', 'wonderful').ratio()
0.11764705882352941
>>> edit_distance.SequenceMatcher('fabulous', 'fabulous').ratio()
1.0
>>> edit_distance.SequenceMatcher('fabulous', '').ratio()
0.0
>>> edit_distance.SequenceMatcher('caterpillar', 'caterpilar').ratio()
0.9523809523809523

So, it looks like the ratio method gives you a number between 0 and 1 (inclusive) where 1 is an exact match and 0 is... not even in the same league XD. So yeah, you could select words that have a ratio greater than 1 - epsilon, where epsilon is maybe 0.1 or thereabouts.

edited May 31 '20 at 06:52

answered May 31 '20 at 06:40

allyourcode

21,871
18
78
106

Thank you! But wouldn't it slow down if the text corpus is large...what could I do then? – Prajwal V Bharadwaj May 31 '20 at 08:51
If the average word size does not increase with corpus size, then the running time of my suggested system would just scale linearly with the size of your corpus, just like any other system you could think of. This might have a slightly larger constant scaling factor compared to other methods, but I expect the speed would be comparable to any other design. The only way to tell if this is fast enough for you is to run it. Such an experiment would not be difficult to perform. Anyway, if this answer helped, please consider upvoting. That is how to say thanks on Stack Overflow. – allyourcode Jun 01 '20 at 10:55
Also, this is very parallelizable (as would any other reasonable design). E.g. it could be implemented in mapreduce if you have a truly gargantuan data set. E.g. the entire WWW. – allyourcode Jun 01 '20 at 11:00
Mmhh okay I understand. This really helped – Prajwal V Bharadwaj Jun 02 '20 at 10:48

Fuzzy string matching using Difflib get_matching_blocks not detecting all substrings

1 Answers1