I am calculating log-odds scores of sequences and returning the motif (small section of sequence) that gives the maximum score. I have code that calculates the maximum score for each sequence in my file, and I am having trouble storing the motif that gives that score. See my other post(s) for file format, general calculation of log-odds scores, etc Perl: Creating and manipulating hash of arrays for log-odds scores of DNA sequences. My code is as follows:
use strict;
use warnings;
use List::Util 'max';
use Data::Dumper;
#USER SPECIFICATIONS
#User specifies motif width
my $width = 3;
#User enters the filename that contains the sequence data
print "Please enter the filename of the fasta sequence data: ";
my $filename1 = <STDIN>;
#Remove newline from file
chomp $filename1;
#Open the file and store each dna seq in hash
my %id2seq = ();
my %HoA = ();
my %loscore = ();
my %maxscore = ();
my %maxmot = ();
my $id = '';
open (FILE, '<', $filename1) or die "Cannot open $filename1.",$!;
my $dna;
while (<FILE>)
{
if($_ =~ /^>(.+)/)
{
$id = $1; #Stores 'Sequence 1' as the first $id, for example
}
else
{
$HoA{$id} = [ split(//) ]; #Splits the contents to allow for position reference later
$id2seq{$id} .= $_; #Creates a hash with each seq associated to an id number
$maxmot{$id} = (); #Creates empty hash to push motifs to
foreach $id (keys %HoA)
{
for my $len (0..(length($HoA{$id})-$width-1))
{
push @{ $loscore{$id} }, 0;
}
}
push @{ $maxscore{$id} }, -30; #Creates a HoA with each id number to have a maxscore (initial score -30)
}
}
close FILE;
foreach $id (keys %id2seq)
{
print "$id2seq{$id}\n\n";
}
print "\n\n";
#EXPECTATION
#Create log-odds table of motifs
my %logodds;
$logodds{'A'}[0] = 0.1;
$logodds{'A'}[1] = 0.2;
$logodds{'A'}[2] = 0.3;
$logodds{'C'}[0] = 0.2;
$logodds{'C'}[1] = 0.5;
$logodds{'C'}[2] = 0.2;
$logodds{'G'}[0] = 0.3;
$logodds{'G'}[1] = 0.2;
$logodds{'G'}[2] = 0.4;
$logodds{'T'}[0] = 0.4;
$logodds{'T'}[1] = 0.1;
$logodds{'T'}[2] = 0.1;
#MAXIMIZATION
#Determine location for each sequence that maximally
#aligns to the motif pattern
foreach $id (keys %HoA)
{
for my $pos1 (0..length($HoA{$id})-$width-1) #Look through all positions the motif can start at
{
for my $pos2 ($pos1..$pos1+($width-1)) #Define the positions within the motif (0 to width-1)
{
for my $base (qw( A C G T))
{
if ($HoA{$id}[$pos2] eq $base) #If the character matches a base:
{
for my $pos3 (0..$width-1) #Used for position reference in %logodds
{
#Calculate the log-odds score at each location
$loscore{$id}[$pos2] += $logodds{$base}[$pos3];
#Calculate the maximum log-odds score for each sequence
#Find the motif that gives the maximum score for each sequence
$maxscore{$id} = max( @{ $loscore{$id} });
if ($loscore{$id}[$pos2] == $maxscore{$id})
{
push @{ $maxmot{$id} }, $HoA{$id}[$pos3]; #NOT SURE THAT THIS IS THE CORRECT WAY TO PUSH WHAT I WANT
}
}
}
}
}
}
}
print "\n\n";
print Dumper(\%maxmot);
The expected output for the %maxmot
should be something like this:
'Sequence 11' => [ 'C', 'A', 'T'],
'Sequence 14' => ['C', 'T', 'G'], etc.
There should be three bases in the expected output because the $width = 3
. The output I get gives me multiples of each base, not in any noticeable order (or at least, I cannot notice a pattern):
'Sequence 11' => [ 'C', 'C', 'C', 'A', 'A', 'A', 'A', 'T', 'A', 'A', 'T', 'T', 'T'],
'Sequence 14' => ['C', 'C', 'T', 'T', 'C', 'C', 'T', 'T', 'T', 'T', 'T'], etc.
I believe the issue is rooted in the push @{ $maxmot{$id} }, $HoA{$id}[$pos3];
step, but I'm not quite sure how to fix it. I have tried pushing $HoA{$id}[$pos2]
instead, but that does not seem to fix my problem either. As always, any and all help is appreciated! I can clarify if needed, I know my question is a little convoluted. Thank you in advance.