Longest common substring across multiple sequences

Question

I'm trying to find the longest common substring (LCS) across multiple sequences.

There are a number of modules on CPAN that implement the LCS algorithm for 2 sequences, such as ~~Algorithm::Diff and~~ String::LCSS_XS, but I'm having a hard time in extending them to work with more than 2 sequences because the LCS across multiple sequences is not necessarily a LCS between any two of them.

Of note, despite its name, the Algorith::MLCS doesn't actually return the LCS but all the common elements (also non-consecutive) of a number of arrays. My impression is that it is broken by design, but I might be wrong.

Algorithm::Diff and Algorith::MLCS solve the longest common subsequence problem, not the longest common substring one.

Is there an obvious way to extend the n=2 algorithms or do I have to implement my version? If yes, how?

Thanks.

I'm no expert, but I suspect that Algorithm::MLCS is not "broken by design". Rather, it's solving a different problem. It deals with the longest common **subsequence** problem. If I understand your question, you're dealing with the longest common **substring** problem. I think Algorithm::Diff is the same: it deals with the subsequence problem. — FMc, Dec 14 '13 at 16:44
Yes, you're correct, I didn't get there was a difference between substring and subsequence! — enricoferrero, Dec 14 '13 at 17:01

enricoferrero · Accepted Answer · 2013-12-15T10:30:14.677

1

This can be solved pretty easily using the Tree::Suffix module.

Example:

#!/usr/bin/env perl
use Modern::Perl;
use Bio::SeqIO;
use Tree::Suffix;

my $seqio = Bio::SeqIO->new(
    -file => "fasta_sequences.txt",
    -format => "Fasta");

my @seqs;

while (my $seqobj = $seqio->next_seq) {
    push @seqs, $seqobj->seq;
}

my $tree = Tree::Suffix->new(@seqs);
my @lcss = $tree->lcs;

say $_ for @lcss;

edited Dec 15 '13 at 10:30

answered Dec 14 '13 at 20:35

enricoferrero

2,249
1
23
28

Longest common substring across multiple sequences

1 Answers1