0

in:

from difflib import SequenceMatcher

print('---------------------ksv in long string')
temp='gksvlkdfvjmflkvmoiflksjvmoiflkvmoilfjvmoierlkvjfdsljfiefjvo\
isfvoiafvjfojwfdkvasldkcosxzfjirkjmcoipfvjopsnosjvjrgegrjsdijfowijfoiwjfoiwjfoiwjfoijlksvlkdfvjmfl\
kvmoiflksjvmoiflkvmoilfjvmoierlkvjfdsljfiefjvofegegewtfvasvervvwfjoiw'

print(SequenceMatcher(None, 'ksv',temp).get_matching_blocks())

print('-----------------------long string start with ksv')
temp='ksvlkdfvjmflkvmoiflksjvmoiflkvmoilfjvmoierlkvjfdsljfiefjvo\
isfvoiafvjfojwfdkvasldkcosxzfjirkjmcoipfvjopsnosjvjrgegrjsdijfowijfoiwjfoiwjfoiwjfoijlksvlkdfvjmfl\
kvmoiflksjvmoiflkvmoilfjvmoierlkvjfdsljfiefjvofegegewtfvasvervvwfjoiw'

print(SequenceMatcher(None, 'ksv',temp).get_matching_blocks())

print('-----------------------ksv in short string')
temp='gksvlkdfvjmflkvmoiflksjvmoiflkvmoilfjvmoierlkvjfdsljfiefjvo'

print(SequenceMatcher(None, 'ksv',temp).get_matching_blocks())

out:

---------------------ksv in long string
[Match(a=3, b=226, size=0)]
-----------------------start with ksv
[Match(a=0, b=0, size=3), Match(a=3, b=225, size=0)]
-----------------------ksv in short string
[Match(a=0, b=1, size=3), Match(a=3, b=59, size=0)]

obviously, for the first match_result,'gks' is in temp, but get_matching_blocks didn't return the block.

then I delete the first 'g' of the temp, it returned the right block.

and i try make the temp shorter and still not start with 'gks', it also returned the right block.

so i'm confused. why the first try didn't succeed?

Kevin liu
  • 53
  • 4

1 Answers1

1

as Tim Peters said,

passing autojunk=False to SequenceMatcher(), it returns the right blocks.

here is some explanation about autojunk, briefly speaking:

1、an item's duplicates account for more than 1% of the sequence. 2、the sequence is more than 200 items.

autojunk will not be matched for sequence matching.

from Python document:

Automatic junk heuristic: SequenceMatcher supports a heuristic that automatically treats certain sequence items as junk. The heuristic counts how many times each individual item appears in the sequence. If an item’s duplicates (after the first one) account for more than 1% of the sequence and the sequence is at least 200 items long, this item is marked as “popular” and is treated as junk for the purpose of sequence matching. This heuristic can be turned off by setting the autojunk argument to False when creating the SequenceMatcher.

Kevin liu
  • 53
  • 4