2

I want to use perl to find fuzzy matches in a file of sequences and return the character number in the string at which the match is found with a given number of substitutions (lets say S=2). For example if my input file is:

Name1
ACTGTGACCTTT
Name2
ACCTTTACTGTG
Name3
GACCTTTCTGTG
Name4
GCACCTTTTGTG
Name5
GCTACCTTTGTG
Name6
ACTGACCTTTTG
Name7
ACTGTACCTTTG
Name8
ACCTTTACCTTT
Name9
ACTGTGACTGTG

and my search query is "ACCTTT".

Then I want my output to be something like:

Name1
6
Name2
0
Name3
1
Name4
2
Name5
3
Name6
4
Name7
5
Name8
0    6

I've tried doing this with String::Approx, but this module only returns the first index for each element of the array I am matching the query with. Also this module seems to be buggy and even when I set the number of Insertions and Deletions to 0, and allow for 2 substitutions, it still returns indexes for matches with many more than 2 substitutions.

Here is the code I was using (in case there is something I don't understand about this module).

#!/usr/bin/perl -w

use String::Approx 'aindex';

my $input_fasta=$ARGV[0];
open(IN,"<$input_fasta") || die ("Error opening $input_fasta $!");
my $l = 0;
my @names;
my @seqs;
while (<IN>){
    if ($l % 2 == 0 ){
    push (@names, $_);
    }
    elsif ($l % 2 ==1) {
    push (@seqs, $_);
    }
    $l++;
}

my @hits = aindex("ACCTTT", ["I0", "D0", "S2"], @seqs);

$hl=0;

foreach (@hits){
    if ($_ != -1){
    print "$names[$hl]$_\n";
    $hl++
    }
    else {
    $hl++;
    }
} 

But this just returns:

Name1
6
Name2
0
Name3
1
Name4
1
Name5
1
Name6
0
Name7
5
Name8
0
zx8754
  • 52,746
  • 12
  • 114
  • 209
Matthew Snyder
  • 383
  • 2
  • 11

0 Answers0