2

I'm writing a GitHub browser extension that displays some diffs in an alternate way. GitHub simply shows pre-computed diffs, so I need to re-diff things myself.

The UNIX diff util is based on finding a Longest Common Subsequence. I found an implementation in javascript-algorithms. However this only displays the LCS result, not the indices at which difference occurs.

Taking the Wikipedia example, calling the above implementation with

longestCommonSubsequence('abcdfghjqz', 'abcdefgijkrxyz');

yields an array,

(8) ["a", "b", "c", "d", "f", "g", "j", "z"]

but what I need is something that allows me to figure out:

abcd fgh j    z
abcdefg ijkrxyz
    +  -+ ++++

I don't believe it's as simple as stated in the Wikipedia article...

From a longest common subsequence it is only a small step to get diff-like output: if an item is absent in the subsequence but present in the first original sequence, it must have been deleted (as indicated by the '-' marks, below). If it is absent in the subsequence but present in the second original sequence, it must have been inserted (as indicated by the '+' marks).

...because for more complex strings (i.e. code), there will be repetitious elements that would require a lot of backtracking to figure out where the "real" differences begin and end.

I notice however that the DP implementation leaves a memoization table, lcsMatrix, which for the abcd... example leaves:

enter image description here

Can the final row and column be used to glean exactly where the differences are?

To generate the above table and output the result, simply add

  console.table(lcsMatrix);
  console.log(longestSequence);

at the end of the linked implementation.

If I figure it out, I'll post a self-answer. So far eluding me, though.

Andrew Cheong
  • 29,362
  • 15
  • 90
  • 145

1 Answers1

2

Take a look at the following... https://github.com/jonTrent/PatienceDiff

Using your data as an example...

diff = patienceDiff('abcdfghjqz'.split(''), 'abcdefgijkrxyz'.split(''));

...returns...

{lines: Array(16), lineCountDeleted: 2, lineCountInserted: 6, lineCountMoved: 0}

lineCountDeleted: 2
lineCountInserted: 6
lineCountMoved: 0
lines: Array(16)
0: {line: "a", aIndex: 0, bIndex: 0}
1: {line: "b", aIndex: 1, bIndex: 1}
2: {line: "c", aIndex: 2, bIndex: 2}
3: {line: "d", aIndex: 3, bIndex: 3}
4: {line: "e", aIndex: -1, bIndex: 4}
5: {line: "f", aIndex: 4, bIndex: 5}
6: {line: "g", aIndex: 5, bIndex: 6}
7: {line: "h", aIndex: 6, bIndex: -1}
8: {line: "i", aIndex: -1, bIndex: 7}
9: {line: "j", aIndex: 7, bIndex: 8}
10: {line: "q", aIndex: 8, bIndex: -1}
11: {line: "k", aIndex: -1, bIndex: 9}
12: {line: "r", aIndex: -1, bIndex: 10}
13: {line: "x", aIndex: -1, bIndex: 11}
14: {line: "y", aIndex: -1, bIndex: 12}
15: {line: "z", aIndex: 9, bIndex: 13}
length: 16

Note that the result refers to "lines", as the algorithm was built with the github style diff in mind, i.e. comparing line by line. But splitting the sample data strings into an array of one character "lines" allows the algorithm to be used for character strings too...

Where aIndex === -1 indicates that the character was added from the second string, and where bIndex === -1 indicates that the character was deleted from the first string.

There's also a version included called patienceDiffPlus, which identifies likely movements of lines / characters... (See also Find difference between two strings in JavaScript )

Trentium
  • 3,419
  • 2
  • 12
  • 19