1

I'm trying to find the longest common substring (LCS) across multiple sequences.

There are a number of modules on CPAN that implement the LCS algorithm for 2 sequences, such as Algorithm::Diff and String::LCSS_XS, but I'm having a hard time in extending them to work with more than 2 sequences because the LCS across multiple sequences is not necessarily a LCS between any two of them.

Of note, despite its name, the Algorith::MLCS doesn't actually return the LCS but all the common elements (also non-consecutive) of a number of arrays. My impression is that it is broken by design, but I might be wrong.

Algorithm::Diff and Algorith::MLCS solve the longest common subsequence problem, not the longest common substring one.

Is there an obvious way to extend the n=2 algorithms or do I have to implement my version? If yes, how?

Thanks.

enricoferrero
  • 2,249
  • 1
  • 23
  • 28
  • 2
    I'm no expert, but I suspect that Algorithm::MLCS is not "broken by design". Rather, it's solving a different problem. It deals with the longest common **subsequence** problem. If I understand your question, you're dealing with the longest common **substring** problem. I think Algorithm::Diff is the same: it deals with the subsequence problem. – FMc Dec 14 '13 at 16:44
  • Yes, you're correct, I didn't get there was a difference between substring and subsequence! – enricoferrero Dec 14 '13 at 17:01

1 Answers1

1

This can be solved pretty easily using the Tree::Suffix module.

Example:

#!/usr/bin/env perl
use Modern::Perl;
use Bio::SeqIO;
use Tree::Suffix;

my $seqio = Bio::SeqIO->new(
    -file => "fasta_sequences.txt",
    -format => "Fasta");

my @seqs;

while (my $seqobj = $seqio->next_seq) {
    push @seqs, $seqobj->seq;
}

my $tree = Tree::Suffix->new(@seqs);
my @lcss = $tree->lcs;

say $_ for @lcss;
enricoferrero
  • 2,249
  • 1
  • 23
  • 28