match changes by words, not by characters

Question

I'm using difflib's SequenceMatcher to get_opcodes() and than highlight the changes with css to create some kind of web diff.

First, I set a min_delta so that I consider two strings different if only 3 or more characters in the whole string differ, one after another (delta means a real, encountered delta, which sums up all one-character changes):

matcher = SequenceMatcher(source_str, diff_str)
min_delta = 3
delta = 0

for tag, i1, i2, j1, j2 in matcher.get_opcodes():
    if tag == "equal":
        continue  # nothing to capture here
    elif tag == "delete":
        if source_str[i1:i2].isspace():
            continue  # be whitespace-agnostic
        else:
            delta += (i2 - i1)  # delete i2-i1 chars
    elif tag == "replace":
        if source_str[i1:i2].isspace() or diff_str[j1:j2].isspace():
            continue  # be whitespace-agnostic
        else:
            delta += (i2 - i1)  # replace i2-i1 chars
    elif tag == "insert":
        if diff_str[j1:j2].isspace():
            continue  # be whitespace-agnostic
        else:
            delta += (j2 - j1)  # insert j2-j1 chars

return_value = True if (delta > min_delta) else False

This helps me to determine, if two strings really differ. Not very efficient, but I didn't think anything better out.

Then, I colorize the differences between two strings in the same way:

for tag, i1, i2, j1, j2 in matcher.get_opcodes():
    if tag == "equal":
        # bustling with strings, inserting them in <span>s and colorizing
    elif tag == "delete":
        # ...

return_value = old_string, new_string

And the result looks pretty ugly (blue for replaced, green for new and red for deleted, nothing for equal):

So, this is happening because SequenceMatcher matches every single character. But I want for it to match every single word instead (and probably whitespaces around them), or something even more eye-candy because as you can see on the screenshot, the first book is actually moved on the fourth position.

It seems to me that something could be done with isjunk and autojunk parameters of SequenceMatcher, but I can't figure out how to write lambdas for my purposes.

Thus, I have two questions:

Is it possible to match by words? Is it possible to do using get_opcodes() and SequenceMatcher? If not, what could by used instead?
Okay, this is rather a corollary, but nevertheless: if matching by words is possible, then I can get rid of the dirty hacks with min_delta and return True as soon as at least one word differs, right?

... didn't occurr to you that `True if delta > min_delta else False` is *exactly the same* as simply `delta > min_delta`? — Bakuriu, Aug 22 '16 at 09:10
Good point! When I'm writing the code this is just a way to add more clearness for myself. I usually refactor the code after to remove unneeded verbosity and to inline some operations. This time I forgot to do this on an example. — , Aug 22 '16 at 10:54

score 15 · Accepted Answer · edited Aug 22 '16 at 09:27

SequenceMatcher can accept lists of str as input.

You can first split the input into words, and then use SequenceMatcher to help you diff words. Then your colored diff would be by words instead of by characters.

>>> def my_get_opcodes(a, b):
...     s = SequenceMatcher(None, a, b)
...     for tag, i1, i2, j1, j2 in s.get_opcodes():
...         print('{:7}   a[{}:{}] --> b[{}:{}] {!r:>8} --> {!r}'.format(
...             tag, i1, i2, j1, j2, a[i1:i2], b[j1:j2]))
... 

>>> my_get_opcodes("qabxcd", "abycdf")
delete    a[0:1] --> b[0:0]      'q' --> ''
equal     a[1:3] --> b[0:2]     'ab' --> 'ab'
replace   a[3:4] --> b[2:3]      'x' --> 'y'
equal     a[4:6] --> b[3:5]     'cd' --> 'cd'
insert    a[6:6] --> b[5:6]       '' --> 'f'

# This is the bad result you currently have.
>>> my_get_opcodes("one two three\n", "ore tree emu\n")
equal     a[0:1] --> b[0:1]      'o' --> 'o'
replace   a[1:2] --> b[1:2]      'n' --> 'r'
equal     a[2:5] --> b[2:5]    'e t' --> 'e t'
delete    a[5:10] --> b[5:5]  'wo th' --> ''
equal     a[10:13] --> b[5:8]    'ree' --> 'ree'
insert    a[13:13] --> b[8:12]       '' --> ' emu'
equal     a[13:14] --> b[12:13]     '\n' --> '\n'

>>> my_get_opcodes("one two three\n".split(), "ore tree emu\n".split())
replace   a[0:3] --> b[0:3] ['one', 'two', 'three'] --> ['ore', 'tree', 'emu']

# This may be the result you want.
>>> my_get_opcodes("one two emily three ha\n".split(), "ore tree emily emu haha\n".split())
replace   a[0:2] --> b[0:2] ['one', 'two'] --> ['ore', 'tree']
equal     a[2:3] --> b[2:3] ['emily'] --> ['emily']
replace   a[3:5] --> b[3:5] ['three', 'ha'] --> ['emu', 'haha']

# A more complicated example exhibiting all four kinds of opcodes.
>>> my_get_opcodes("one two emily three yo right end\n".split(), "ore tree emily emu haha yo yes right\n".split())
replace   a[0:2] --> b[0:2] ['one', 'two'] --> ['ore', 'tree']
equal     a[2:3] --> b[2:3] ['emily'] --> ['emily']
replace   a[3:4] --> b[3:5] ['three'] --> ['emu', 'haha']
equal     a[4:5] --> b[5:6]   ['yo'] --> ['yo']
insert    a[5:5] --> b[6:7]       [] --> ['yes']
equal     a[5:6] --> b[7:8] ['right'] --> ['right']
delete    a[6:7] --> b[8:8]  ['end'] --> []

You can also diff by line, by book, or by segments. You only need to prepare a function that can preprocess the whole passage string into desired diff chunks.

For example:

To diff by line - You probably could use splitlines()
To diff by book - You probably could implement a function that strips off the 1., 2.
To diff by segments - You could throw in the API like this way ([book_1, author_1, year_1, book_2, author_2, ...], [book_1, author_1, year_1, book_2, author_2, ...]). And then your coloring would be by segment.

Please do not abuse inline code formatting to highlight things that aren't either code (e.g. class/module/function names, small code samples etc.) or to be taken literally as written (e.g. an unique identifier that should be written exactly as displayed). To emphasize other kind of content use either *italics* (\*italics\*) or **bold** (\*\*bold\*\*). — Bakuriu, Aug 22 '16 at 09:30

match changes by words, not by characters

1 Answers1

Linked