I was experimenting with fuzzywuzzy and encountered that for quite a few cases it was generating wrong result. I tried to debug and encountered a scenario with get_matching_blocks() which was difficult to explain.
My understanding of get_matching_blocks() is, it should return a triplet tuple (i,j,n) where the sub-string of length n
in the first string at index i
should match exactly with the sub-string of length n
in the second string at index j.
>>> hay = """"Find longest matching block in a[alo:ahi] and b[blo:bhi]. If isjunk was omitted or None, find_longest_match() returns (i, j, k) such that a[i:i+k] is equal to b[j:j+k], where alo <= i <= i+k <= ahi and blo <= j <= j+k <= bhi. For all (i', j', k') meeting those conditions, the additional conditions k >= k', i <= i', and if i == i', j <= j' are also met. In other words, of all maximal matching blocks, return one that starts earliest in a, and of all those maximal matching blocks that start earliest in a, return the one that starts earliest in b."""
>>> needle = "meeting those conditions"
>>> needle in hay
True
>>> sm = difflib.SequenceMatcher(None,needle,hay)
>>> sm.get_matching_blocks()
[Match(a=5, b=8, size=2), Match(a=24, b=550, size=0)]
>>>
SO why the above code fails to find the matching block?