Perl module or code for finding overlapping region of two strings

Question

I have two strings.

They are not substrings of each other but there is a overlapping region between them.

my $str1 = "AAAAAAAAAABBBBBBBBCC";
my $str2 = "BBBBBBBBCCZZZZZZZZZZ";

I want to find this overlapping region.

 "AAAAAAAAAABBBBBBBBCC"
           "BBBBBBBBCCZZZZZZZZZZ"

Overlap is "BBBBBBBBCC"

I searched CPAN and google extensively.

There are many modules about "Edit Distance" Method such as Algorithm::Diff, Text::Levenshtein or Text::OverlapFinder and String::Similarity. But, they are not what I am looking for.

String should not be gaped (Insert or Delete any character) or substituted. It's similar to sequence alignment in bioinformatics but without gap "open" and "extension" permission unless in both extremes.

I was wondering if anyone found a solution or a work around yet.

Why don’t you just want to use something like `"$a $b" =~ /(\S+) \1/` here? — tchrist, Jun 08 '14 at 19:03

score 4 · Answer 1 · answered Jun 06 '14 at 14:08

4

Check String::LCSS_XS module,

use String::LCSS_XS 'lcss';

my ($s1,$s2) = qw(
  AAAAAAAAAABBBBBBBBBB
  BBBBBBBBBBCCCCCCCCCC
);
my $longest = lcss ($s1, $s2);
print "$longest\n";

output

BBBBBBBBBB

answered Jun 06 '14 at 14:08

mpapec

50,217
8
67
127

Thanks. It work good. But there is a problem. When I put mismatchs at the string extremes, eg. AAAAAAAAAABBBBBBDDBBBB ,"lcss" still is returning BBBBBB. I know this is "longest common subsequence" but for my purpose this is wrong. I need "notdefined" result at this condition, becuase string can't accept any gaps during match. Do you have any idea why I am getting this? – Morteza Jun 06 '14 at 14:44
@Morteza so `BBBBBB` is shorter than `BBBBBBBBBB` (it is not same output). – mpapec Jun 06 '14 at 14:46
Yes, But I need to return output when end of one string and end of another string are identical. my desired result is BBBB not BBBBBB. I don't want longest common substring, i want longest common substring in terminals of strings if exists. eg. if $string1=AAAAAAAAAABBBBBBDD; then $longest=""; – Morteza Jun 06 '14 at 15:13
1

@Morteza if this is bioperl, there are quite a few dedicated modules http://search.cpan.org/search?query=bio&mode=all – mpapec Jun 06 '14 at 16:49
No, I'm familiar with bioperl module but they have not what I am looking for. – Morteza Jun 07 '14 at 03:46
@Morteza: *" I need to return output when end of one string and end of another string are identical"* but the example in your question shows strings `AAAAAAAAAABBBBBBBBCC` and `BBBBBBBBCCZZZZZZZZZZ` having an overlap of `BBBBBBBBCC` but the ends of those two strings are not identical. What do you really want? – Borodin Jun 08 '14 at 23:04

Miller · Accepted Answer · 2014-06-06T21:16:50.563

1

Because you're searching for the bounded overlap, this is a simple enough problem that brute force is the way to go. Equalize the string lengths, and then just chop off characters until you find a match.

There are some potential avenues to make this more efficient, but only explore those IF this becomes too slow.

use strict;
use warnings;

sub overlap {
    my ($str1, $str2) = @_;

    # Equalize Lengths
    if (length $str1 < length $str2) {
        $str2 = substr $str2, 0, length($str1);
    } elsif (length $str1 > length $str2) {
        $str1 = substr $str1, length($str1) - length($str2);
    }

    # Reduce until match found
    while ($str1 ne $str2) {
        substr $str1, 0, 1, '';
        chop $str2;
    }

    return $str1;
}

while (<DATA>) {
    print "Overlap is " . overlap(split), "\n";

}

__DATA__
AAAAAAAAAABBBBBBBBBB  BBBBBBBBBBCCCCCCCCCC
aln.trp.leu.tre       leu.tre.met.ile
aaaaaaaaaaaaaaaaaaaZ  aaaaaaaaaaaaaaa

Outputs:

Overlap is BBBBBBBBBB
Overlap is leu.tre
Overlap is

edited Jun 06 '14 at 21:16

answered Jun 06 '14 at 16:27

Miller

34,962
4
39
60

Thanks. But there is another problem. String is more complicated than my example. If you have a Peptide sequence for example "aln.trp.leu.tre" and "leu.tre.met.ile", this not work. – Morteza Jun 06 '14 at 16:40
1

Created a new solution that just does brute force to find overlaps. – Miller Jun 06 '14 at 19:07
Dear Miller, Can you update your code to consider both direction of strings. Current solution only work for overlaps in one side of them. For example, for AAAAAAzzzBBBB and BBBBzzzAAAAAA it returns BBBB not AAAAAA. I think, it's possible by switching two string and selecting longest result. Thanks again. – Morteza Jun 07 '14 at 07:28
If this was a paid relationship, of course. But considering this is just volunteering, that's a trivial enough enhancement that you should just do it yourself. Simply create a new function that calls this one twice, and compare the returned results. You'll have to take into account what you want the result to be if the lengths are equal, but those are design features that I shouldn't be deciding anyway. Cheers :) – Miller Jun 07 '14 at 17:20
That’s a lot of work, but perhaps I misunderstand the question. – tchrist Jun 08 '14 at 19:03
@tchrist What's a lot of work? I don't understand the comment. *Btw, Hi. Appreciate your books* – Miller Jun 08 '14 at 21:16
What’s a lot of work is using substr instead of pattern matching, but I think I don’t quite grok the question. And thanks. – tchrist Jun 08 '14 at 21:25

Perl module or code for finding overlapping region of two strings

2 Answers2