I'm writing a GitHub browser extension that displays some diffs in an alternate way. GitHub simply shows pre-computed diffs, so I need to re-diff things myself.
The UNIX diff util is based on finding a Longest Common Subsequence. I found an implementation in javascript-algorithms. However this only displays the LCS result, not the indices at which difference occurs.
Taking the Wikipedia example, calling the above implementation with
longestCommonSubsequence('abcdfghjqz', 'abcdefgijkrxyz');
yields an array,
(8) ["a", "b", "c", "d", "f", "g", "j", "z"]
but what I need is something that allows me to figure out:
abcd fgh j z
abcdefg ijkrxyz
+ -+ ++++
I don't believe it's as simple as stated in the Wikipedia article...
From a longest common subsequence it is only a small step to get diff-like output: if an item is absent in the subsequence but present in the first original sequence, it must have been deleted (as indicated by the '-' marks, below). If it is absent in the subsequence but present in the second original sequence, it must have been inserted (as indicated by the '+' marks).
...because for more complex strings (i.e. code), there will be repetitious elements that would require a lot of backtracking to figure out where the "real" differences begin and end.
I notice however that the DP implementation leaves a memoization table, lcsMatrix
, which for the abcd...
example leaves:
Can the final row and column be used to glean exactly where the differences are?
To generate the above table and output the result, simply add
console.table(lcsMatrix);
console.log(longestSequence);
at the end of the linked implementation.
If I figure it out, I'll post a self-answer. So far eluding me, though.