0

In an attempt to find common substrings between two strings, SequenceMatcher does not return all expected common substrings.

s1 = '++%2F%2F+Prints+%22Hello%2C+World%22+to+the+terminal+window.%0A++++++++System.out.pr%29%3B%0A++++%7D%0A%7D%0ASample+program%0Apublic+static+voclass+id+main%28String%5B%5D+args%29+'
s2 = 'gs%29+%7B%0A++++++++%2F'
# The common substring are '+%', '%0A++++++++', '%s' and 'gs%29+'
# but 'gs%29+' is not matched.

import difflib as d

seqmatch = d.SequenceMatcher(None,s1,s2)
matches = seqmatch.get_matching_blocks()

for match in matches:
    apos, bpos, matchlen = match
    print(s1[apos:apos+matchlen])

Output:

+%
%0A++++++++
%2

"gs%29+" is a common substring between s1 and s2, but it is not found by SequenceMatcher.

Am I missing something?

Thanks

rroutsong
  • 45
  • 5
  • Your question is not clear. Please revise your sentence. What is the expected output? – yoonghm Oct 05 '18 at 22:58
  • "gs%29+" is a common substring between s1 and s2, it is not in the list of matches that SequenceMatcher produces – rroutsong Oct 05 '18 at 23:04
  • I believe it is a bug for `difflib.get_matching_blocks()`: it does not re-look the passed `s2`characters once it has found a match. It found `s2[5:7]`, `s2[9:20]`, `s2[20:22]`, so it will not go back to find `s2[0:5]` – yoonghm Oct 06 '18 at 00:00

1 Answers1

0

Perhaps the junk characters have confused the algorithm. I added a lambda function for isjunk within SequenceMatcher()

s1 = '++%2F%2F+Prints+%22Hello%2C+World%22+to+the+terminal+window.%0A++++++++System.out.pr%29%3B%0A++++%7D%0A%7D%0ASample+program%0Apublic+static+voclass+id+main%28String%5B%5D+args%29+'
s2 = 'gs%29+%7B%0A++++++++%2F'
# The expected substring is 'gs%29+'

import difflib as d

seqmatch = d.SequenceMatcher(lambda x: x in "+", s1, s2)
matches = seqmatch.get_matching_blocks()

for match in matches:
    apos, bpos, matchlen = match
    print(s1[apos:apos+matchlen])

The output is now

gs%29+
yoonghm
  • 4,198
  • 1
  • 32
  • 48
  • Unfortunately this only works for that particular match, and not with the rest of the 15=< character slices of the mixed up url encoded string I am trying to reassemble. I chose to write a custom brute force approach to comparing the beginning or end of the template string `s1` to the target string `s2`. I will take a look at `SequenceMatcher.get_matching_blocks()` and see if there a bug fix for this issue missing substrings issue could be implemented. – rroutsong Oct 08 '18 at 20:51