1

So I've been doing some practice problems for both Perl & Python (kinda choosing between the 2) and I got a problem where I need to make my own diff algorithm just like Github's. I'm up to the point where I know that the Longest Common Subsequence problem is a big part of the solution. I used the wikipedia page for LCS as reference but I'm still having trouble figuring out the diff part.

I also realize there are already modules on CPAN like Algorithm:Diff, but this is mostly just for practice and those feel like cheating.

I figured out the python/pseudocode version of the algorithm but I plan on doing it with multi-dimensional arrays, which Perl doesn't seem to have.

Now I'm up to where I can successfully get the Longest Common Subsequence length in Perl.

Basically the pseudocode (almost python-like, but is supposed to be for Perl) I can think of is something like this:

function lengthOfLCS(string1, string2){
    if length(string1) == 0 or length(string2) == 0:
         return 0
    else if string1[0] eq string2[0]: 
         return 1+ lengthOfLCS(stringA[1:], stringB[1:])
    return max(lengthOfLCS(string1, string2[1:], lengthOfLCS(string1[1:], string2))

I haven't implemented it yet, but I think that's basically how I can calculate the length of LCS of two strings?

Output wise, it should return 4 against "HUMAN" & "CHIMPANZEE" (LCS = HMAN)

So what I'm asking is how do I get to printing Diffs using Perl from this point on? I'm aware that instead of only the length of the LCS, I should've a List/Array returned instead, which is doable by returning a multi-dimensional list in the LCS function and then processing it later on in a separate diff function.

I'm kinda new to Perl, so any pointers/tips would be greatly appreciated. Thanks.

Anthony Wijaya
  • 427
  • 1
  • 5
  • 12
  • 1
    This is [answered](http://en.wikipedia.org/wiki/Longest_common_subsequence_problem) in detail on Wikipedia – ikegami Apr 09 '15 at 14:32
  • There is also `Algorithm::MLCS`, `Algorithm::Diff`, `Algorithm::LCS`, and `Algorithm::NeedlemanWunsch` on CPAN. – Sinan Ünür Apr 09 '15 at 14:57
  • Hi, thanks for the suggestions. @ikegami Actually I did reference the wiki page, forgotten to mention it here. However, I still don't understand the Diff bit, the code explained in wikipedia mostly uses multi-dimensional arrays, which Perl doesn't have? Can you suggest me a Perl alternative? also about CPAN, I'm mostly doing this just for practice not for actual usage (purely educational), so I'd prefer not using those. Thanks anyway, I'd probably look on the source code to try figure them out. – Anthony Wijaya Apr 09 '15 at 15:34
  • 3
    It has arrays of (references to) arrays. Same thing. `$a[5][7]` – ikegami Apr 09 '15 at 15:39

1 Answers1

0

You can use my reference implementation of LCS in Perl which needs two array-references as input and returns an array of two element arrays containing the indices of the matching elements.

use LCS;
my $lcs = LCS->LCS( [qw(a b)], [qw(a b b)] );
# $lcs now contains an arrayref of matching positions
# same as
$lcs = [
  [ 0, 0 ],
  [ 1, 2 ]
];

LCS uses the traditional algorithm and reads out the LCS iteratively (see my blog post Loopify Recursions at wollmers-perl.blogspot.de), i.e. not recursive (most sample codes use recursions, which does not scale well in Perl). So if you want to learn from the code, look into the subs LCS() and _lcs().

If you want the diff, i.e. an edit script, you can reconstruct it from the LCS array.

The method lcs2align() does nearly this.

use Data::Dumper;
use LCS;
print Dumper(
  LCS->lcs2align(
    [qw(a   b)],
    [qw(a b b)],
    LCS->LCS([qw(a b)],[qw(a b b)])
  )
);
# prints
$VAR1 = [
          [
            'a',
            'a'
          ],
          [
            '',
            'b'
          ],
          [
            'b',
            'b'
          ]
];

A diff in the format of sdiff() (see Algorithm::Diff) would now look like:

[
  [ 'u', 'a', 'a'  ],
  [ '+', '',  'b'  ],
  [ 'u', 'b', 'b'  ],
]

How you get the edit script from the alignment should be trivial and is left as an exercise.

If you want faster implementations you can use LCS::Tiny, or the fastest pure Perl implementation LCS::BV, or fastest for larger scales Algorithm::Diff::XS (see my blog post Tuning Algorithm::Diff at wollmers-perl.blogspot.de)

Please keep in mind that an edit script based on LCS does not automatically provide a SES (shortest edit script). LCS is based on edit operations only allowing insert and delete (simple edit distance). SES algorithms usually minimize Levenshtein distance (insert, delete and mismatch).