I'm using difflib
's SequenceMatcher
to get_opcodes()
and than highlight the changes with css
to create some kind of web diff
.
First, I set a min_delta
so that I consider two strings different if only 3 or more characters in the whole string differ, one after another (delta
means a real, encountered delta, which sums up all one-character changes):
matcher = SequenceMatcher(source_str, diff_str)
min_delta = 3
delta = 0
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
if tag == "equal":
continue # nothing to capture here
elif tag == "delete":
if source_str[i1:i2].isspace():
continue # be whitespace-agnostic
else:
delta += (i2 - i1) # delete i2-i1 chars
elif tag == "replace":
if source_str[i1:i2].isspace() or diff_str[j1:j2].isspace():
continue # be whitespace-agnostic
else:
delta += (i2 - i1) # replace i2-i1 chars
elif tag == "insert":
if diff_str[j1:j2].isspace():
continue # be whitespace-agnostic
else:
delta += (j2 - j1) # insert j2-j1 chars
return_value = True if (delta > min_delta) else False
This helps me to determine, if two strings really differ. Not very efficient, but I didn't think anything better out.
Then, I colorize the differences between two strings in the same way:
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
if tag == "equal":
# bustling with strings, inserting them in <span>s and colorizing
elif tag == "delete":
# ...
return_value = old_string, new_string
And the result looks pretty ugly (blue for replaced, green for new and red for deleted, nothing for equal):
So, this is happening because SequenceMatcher
matches every single character. But I want for it to match every single word instead (and probably whitespaces around them), or something even more eye-candy because as you can see on the screenshot, the first book is actually moved on the fourth position.
It seems to me that something could be done with isjunk
and autojunk
parameters of SequenceMatcher
, but I can't figure out how to write lambda
s for my purposes.
Thus, I have two questions:
Is it possible to match by words? Is it possible to do using
get_opcodes()
andSequenceMatcher
? If not, what could by used instead?Okay, this is rather a corollary, but nevertheless: if matching by words is possible, then I can get rid of the dirty hacks with
min_delta
and returnTrue
as soon as at least one word differs, right?