2

I have a DNA sequence, like ATCGATCG for example. I also have a database of DNA sequences formatted as follows:

>Name of sequence1
SEQUENCEONEEXAMPLEGATCGATC
>Name of sequence2
SEQUENCETWOEXAMPLEGATCGATC

(So the odd numbered lines contain a name, and the even numbered lines contain a sequence) Currently, I search for perfect matches between my sequence and sequences in the database as follows (assume all the variables are declared):

my $name;
my $seq;
my $returnval = "The sequence does not match any in database";
open (my $database, "<", $db1) or die "Can't find db1";
until (eof $database){
    chomp ($name = <$database>);
    chomp ($seq = <$database>);
    if (
        index($seq, $entry) != -1
        || index($entry, $seq) != -1
    ) {
        $returnval = "The sequence matches: ". $name;
        last;
    }
}
close $database;

Is there any way for me to return the name of the highest percentage matched sequence as well as percent match there is between the entry and the sequence in the database?

randominstanceOfLivingThing
  • 16,873
  • 13
  • 49
  • 72
Aditya J.
  • 131
  • 2
  • 11
  • 1
    How big is the database? – Zaid Aug 16 '16 at 17:55
  • Not sure if [`String::Approx`](https://metacpan.org/pod/String::Approx) would help you here. – Zaid Aug 16 '16 at 17:59
  • 1
    You can break up your string and go char by char, even as it is picky. Something just like it has been done in [`this post`](http://stackoverflow.com/questions/9106978/perl-partial-match), for example. Better, find a module -- for example, [`Text::Fuzzy`](http://search.cpan.org/~bkb/Text-Fuzzy-0.24/lib/Text/Fuzzy.pod) should do it. – zdim Aug 16 '16 at 18:15
  • You could look for the minimum Levenshtein edit distance (and convert that into a percentage). – Michael Carman Aug 16 '16 at 19:26
  • I don't think String::Approx would help here, @Zaid: the sequences aren't all the same length. Also, the largest database has 1,881 entries, so 3,762 lines. I'll try using Text::Fuzzy and text::Levenshtein to find a percent match. – Aditya J. Aug 16 '16 at 20:09
  • It appears String::Similarity returns a percentage. – Aditya J. Aug 16 '16 at 20:19
  • I do recommend a good moodule -- but, just btw, it shouldn't be too hard to roll your own. – zdim Aug 17 '16 at 00:20

1 Answers1

3

String::Similarity returns the similarity between strings as a value between 0 and 1, 0 being completely dissimilar and 1 being exactly the same.

my $entry = "AGGUUG" ;
my $returnval;
my $name;
my $seq;
my $currsim;
my $highestsim = 0;
my $highestname;
open (my $database, "<", $db1) or die "Can't find db1";
until (eof $database){
    chomp ($name = <$database>);
    chomp ($seq = <$database>);
    $currsim = similarity $entry, $seq, $highestsim;
    if ($currsim > $highestsim) {
        $highestsim = $currsim;
        $highestname = $name;
    }
}
$highestsim = $highestsim * 100;
my @names = split(/>/, $highestname);
$returnval = "This sequence matches " . $names[1] . " the best with " . $highestsim . "% similarity";
close $database;
Aditya J.
  • 131
  • 2
  • 11