Perl: Return Highest Percent Match for Strings

Question

I have a DNA sequence, like ATCGATCG for example. I also have a database of DNA sequences formatted as follows:

>Name of sequence1
SEQUENCEONEEXAMPLEGATCGATC
>Name of sequence2
SEQUENCETWOEXAMPLEGATCGATC

(So the odd numbered lines contain a name, and the even numbered lines contain a sequence) Currently, I search for perfect matches between my sequence and sequences in the database as follows (assume all the variables are declared):

my $name;
my $seq;
my $returnval = "The sequence does not match any in database";
open (my $database, "<", $db1) or die "Can't find db1";
until (eof $database){
    chomp ($name = <$database>);
    chomp ($seq = <$database>);
    if (
        index($seq, $entry) != -1
        || index($entry, $seq) != -1
    ) {
        $returnval = "The sequence matches: ". $name;
        last;
    }
}
close $database;

Is there any way for me to return the name of the highest percentage matched sequence as well as percent match there is between the entry and the sequence in the database?

Not sure if [`String::Approx`](https://metacpan.org/pod/String::Approx) would help you here. — Zaid, Aug 16 '16 at 17:59
You can break up your string and go char by char, even as it is picky. Something just like it has been done in [`this post`](http://stackoverflow.com/questions/9106978/perl-partial-match), for example. Better, find a module -- for example, [`Text::Fuzzy`](http://search.cpan.org/~bkb/Text-Fuzzy-0.24/lib/Text/Fuzzy.pod) should do it. — zdim, Aug 16 '16 at 18:15
You could look for the minimum Levenshtein edit distance (and convert that into a percentage). — Michael Carman, Aug 16 '16 at 19:26
I don't think String::Approx would help here, @Zaid: the sequences aren't all the same length. Also, the largest database has 1,881 entries, so 3,762 lines. I'll try using Text::Fuzzy and text::Levenshtein to find a percent match. — Aditya J., Aug 16 '16 at 20:09
I do recommend a good moodule -- but, just btw, it shouldn't be too hard to roll your own. — zdim, Aug 17 '16 at 00:20

Aditya J. · Accepted Answer · 2016-08-17T14:04:03.193

3

String::Similarity returns the similarity between strings as a value between 0 and 1, 0 being completely dissimilar and 1 being exactly the same.

my $entry = "AGGUUG" ;
my $returnval;
my $name;
my $seq;
my $currsim;
my $highestsim = 0;
my $highestname;
open (my $database, "<", $db1) or die "Can't find db1";
until (eof $database){
    chomp ($name = <$database>);
    chomp ($seq = <$database>);
    $currsim = similarity $entry, $seq, $highestsim;
    if ($currsim > $highestsim) {
        $highestsim = $currsim;
        $highestname = $name;
    }
}
$highestsim = $highestsim * 100;
my @names = split(/>/, $highestname);
$returnval = "This sequence matches " . $names[1] . " the best with " . $highestsim . "% similarity";
close $database;

edited Aug 17 '16 at 14:04

answered Aug 16 '16 at 20:38

Aditya J.

131
2
11

1

You should see a performance improvement if you pass `$highestsim` as a third argument to `similarity` -- it causes it to stop comparing once the similarity drops below the given limit. – Michael Carman Aug 17 '16 at 12:54
Makes sense. I'll add it – Aditya J. Aug 17 '16 at 14:03

Perl: Return Highest Percent Match for Strings

1 Answers1