1

I am calculating log-odds scores of sequences and returning the motif (small section of sequence) that gives the maximum score. I have code that calculates the maximum score for each sequence in my file, and I am having trouble storing the motif that gives that score. See my other post(s) for file format, general calculation of log-odds scores, etc Perl: Creating and manipulating hash of arrays for log-odds scores of DNA sequences. My code is as follows:

use strict;
use warnings;
use List::Util 'max';
use Data::Dumper; 

#USER SPECIFICATIONS
#User specifies motif width
my $width = 3;

#User enters the filename that contains the sequence data
print "Please enter the filename of the fasta sequence data: ";
my $filename1 = <STDIN>;

#Remove newline from file
chomp $filename1;

#Open the file and store each dna seq in hash
my %id2seq = ();
my %HoA = ();
my %loscore = ();
my %maxscore = ();
my %maxmot = ();
my $id = '';
open (FILE, '<', $filename1) or die "Cannot open $filename1.",$!;
my $dna;
while (<FILE>)
{
    if($_ =~ /^>(.+)/)
    {
        $id = $1; #Stores 'Sequence 1' as the first $id, for example
    }
    else
    {
        $HoA{$id} = [ split(//) ]; #Splits the contents to allow for position reference later
        $id2seq{$id} .= $_; #Creates a hash with each seq associated to an id number
        $maxmot{$id} = (); #Creates empty hash to push motifs to
        foreach $id (keys %HoA)
        {
            for my $len (0..(length($HoA{$id})-$width-1))
            {
                push @{ $loscore{$id} }, 0;
            }
        }
        push @{ $maxscore{$id} }, -30; #Creates a HoA with each id number to have a maxscore (initial score -30)
    }
}
close FILE;

foreach $id (keys %id2seq)
{
    print "$id2seq{$id}\n\n";
}
print "\n\n";



#EXPECTATION
#Create log-odds table of motifs
my %logodds;
$logodds{'A'}[0] = 0.1;
$logodds{'A'}[1] = 0.2;
$logodds{'A'}[2] = 0.3;
$logodds{'C'}[0] = 0.2;
$logodds{'C'}[1] = 0.5;
$logodds{'C'}[2] = 0.2;
$logodds{'G'}[0] = 0.3;
$logodds{'G'}[1] = 0.2;
$logodds{'G'}[2] = 0.4;
$logodds{'T'}[0] = 0.4;
$logodds{'T'}[1] = 0.1;
$logodds{'T'}[2] = 0.1;

#MAXIMIZATION
#Determine location for each sequence that maximally
#aligns to the motif pattern

foreach $id (keys %HoA)
{   
    for my $pos1 (0..length($HoA{$id})-$width-1)    #Look through all positions the motif can start at
    {
        for my $pos2 ($pos1..$pos1+($width-1))  #Define the positions within the motif (0 to width-1)
        {           
            for my $base (qw( A C G T))
            {
                if ($HoA{$id}[$pos2] eq $base)  #If the character matches a base:
                {
                    for my $pos3 (0..$width-1)  #Used for position reference in %logodds
                    {
                        #Calculate the log-odds score at each location
                        $loscore{$id}[$pos2] += $logodds{$base}[$pos3];

                        #Calculate the maximum log-odds score for each sequence

                        #Find the motif that gives the maximum score for each sequence
                        $maxscore{$id} = max( @{ $loscore{$id} });
                        if ($loscore{$id}[$pos2] == $maxscore{$id})
                        {
                            push @{ $maxmot{$id} }, $HoA{$id}[$pos3]; #NOT SURE THAT THIS IS THE CORRECT WAY TO PUSH WHAT I WANT
                        }
                    }
                }
            }
        }
    }
}

print "\n\n";
print Dumper(\%maxmot);

The expected output for the %maxmot should be something like this:

'Sequence 11' => [ 'C', 'A', 'T'], 'Sequence 14' => ['C', 'T', 'G'], etc.

There should be three bases in the expected output because the $width = 3. The output I get gives me multiples of each base, not in any noticeable order (or at least, I cannot notice a pattern):

'Sequence 11' => [ 'C', 'C', 'C', 'A', 'A', 'A', 'A', 'T', 'A', 'A', 'T', 'T', 'T'], 'Sequence 14' => ['C', 'C', 'T', 'T', 'C', 'C', 'T', 'T', 'T', 'T', 'T'], etc. I believe the issue is rooted in the push @{ $maxmot{$id} }, $HoA{$id}[$pos3]; step, but I'm not quite sure how to fix it. I have tried pushing $HoA{$id}[$pos2] instead, but that does not seem to fix my problem either. As always, any and all help is appreciated! I can clarify if needed, I know my question is a little convoluted. Thank you in advance.

William
  • 39
  • 7

2 Answers2

3

The push() is not the problem. From running your code it becomes obvious that the conditional $loscore{$id}[$pos2] == $maxscore{$id} is true more often than you expect it.

Here are some questions I would ask in a code review:

  • why do you use length() on an array? It does not return the length of the array.
  • Isn't for my $base (qw( A C G T)) { if ($HoA{$id}[$pos2] eq $base) {... just an inefficient way of the equivalent my $base = $HoA{$id}[$pos2];?
  • the calculation for each $pos2 is executed $pos2 + 1 times, i.e. once for 0, twice for 1, ... i.e. later positions get a higher score.
  • one calculation for $loscore{$id}[$pos2] is the sum of @{ $logodds{$base} }, i.e. the base at position $pos2 + $pos3 is ignored for the calculation
  • you are re-calculating $maxscore{$id} while running over the sequences and then use the changing value in the conditional
  • (my guess) a motif is supposed to be $width bases long, but your code only stores single bases into %maxmot

I'm making an educated guess and propose that the following is the correct algorithm. I'm using the 3 sequences you have given in your previous question. I'm also dumping the other 2 hashes, so that the calculation results become visible.

I took the liberty to rewrite your code to be more concise and clear. But you should be able to trace back the lines in the new code to your original code.

#!/usr/bin/perl
use strict;
use warnings;

use List::Util 'max';
use Data::Dumper;

my $width = 3;

my %HoA;
my %maxpos;
my %loscore;
my $id = '';
while (<DATA>) {
    if (/^>(.+)/) {
        $id = $1;
    } else {
        $HoA{$id}     = [ split(//) ];
        $maxpos{$id}  = @{ $HoA{$id} } - $width - 1;
        $loscore{$id} = [ (0) x ($maxpos{$id} + 1) ];
    }
}

my %logodds = (
    A => [0.1, 0.2, 0.3],
    C => [0.2, 0.5, 0.2],
    G => [0.3, 0.2, 0.4],
    T => [0.4, 0.1, 0.1],
);

#MAXIMIZATION
my %maxscore;
my %maxmot;

# Calculate the log-odds score at each location
foreach $id (keys %HoA) {
    for my $index (0..$maxpos{$id}) {
        for my $offset (0..$width-1) {
            # look at base in sequence $id at $offset after $index
            my $base = $HoA{$id}[$index + $offset];
            $loscore{$id}[$index] += $logodds{$base}[$offset];
        }
    }
}

# Calculate the maximum log-odds score for each sequence
foreach $id (keys %HoA) {
    $maxscore{$id} = max( @{ $loscore{$id} });
}

# Find the motif that gives the maximum score for each sequence
foreach $id (keys %HoA) {
    for my $index (0..$maxpos{$id}) {
        if ($loscore{$id}[$index] == $maxscore{$id}) {
            # motif of length $width
            my $motif = join('', @{ $HoA{$id} }[$index..$index + $width - 1]);
            $maxmot{$id}->{$motif}++;
        }
    }
}

print Data::Dumper->Dump([\%loscore, \%maxscore, \%maxmot],
                         [qw(*loscore *maxscore *maxmot)]);

exit 0;

__DATA__
>Sequence_1
TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT
>Sequence_2
CCCACGCAGCCGCCCTCCTCCCCGGTCACTGACTGGTCCTG
>Sequence_3
TCGACCCTCTGGAACCTATCAGGGACCACAGTCAGCCAGGCAAG

Test run:

$ perl dummy.pl
%loscore = (
             'Sequence_1' => [
                               '1.2',
                               '0.8',
                               '0.6',
                               '0.8',
                               '0.5',
                               '0.8',
                               '1',
                               '0.8',
                               '0.4',
                               '0.5',
                               '0.8',
                               '0.7',
                               '0.5',
                               '0.9',
                               '0.6',
                               '0.4',
                               '0.3',
                               '0.6',
                               '0.8',
                               '0.7',
                               '0.4',
                               '1.2',
                               '0.5',
                               '0.3',
                               '0.6',
                               '0.7',
                               '1.1',
                               '0.8',
                               '0.4',
                               '0.7',
                               '1',
                               '0.5',
                               '1.1',
                               '1',
                               '0.6',
                               '0.7',
                               '0.5',
                               '1.1',
                               '0.8'
                             ],
             'Sequence_2' => [
                               '0.9',
                               '1',
                               '0.6',
                               '1',
                               '0.6',
                               '1.1',
                               '0.8',
                               '0.5',
                               '1',
                               '1.1',
                               '0.6',
                               '1',
                               '0.9',
                               '0.8',
                               '0.5',
                               '1.1',
                               '0.8',
                               '0.5',
                               '1.1',
                               '0.9',
                               '0.9',
                               '1.1',
                               '0.8',
                               '0.6',
                               '0.6',
                               '1.2',
                               '0.6',
                               '0.7',
                               '0.7',
                               '0.9',
                               '0.7',
                               '0.7',
                               '0.7',
                               '1',
                               '0.6',
                               '0.6',
                               '1.1',
                               '0.8',
                               '0.7'
                             ],
             'Sequence_3' => [
                               '1.3',
                               '0.7',
                               '0.7',
                               '0.8',
                               '0.9',
                               '0.8',
                               '0.5',
                               '1',
                               '0.7',
                               '1',
                               '0.8',
                               '0.8',
                               '0.5',
                               '0.8',
                               '0.8',
                               '0.6',
                               '0.7',
                               '0.4',
                               '1.2',
                               '0.8',
                               '0.7',
                               '0.9',
                               '0.8',
                               '0.7',
                               '0.8',
                               '1',
                               '0.6',
                               '0.9',
                               '0.8',
                               '0.4',
                               '0.6',
                               '1.2',
                               '0.8',
                               '0.5',
                               '1',
                               '1',
                               '0.8',
                               '0.7',
                               '0.7',
                               '1.1',
                               '0.7',
                               '0.7'
                             ]
           );
%maxscore = (
              'Sequence_1' => '1.2',
              'Sequence_3' => '1.3',
              'Sequence_2' => '1.2'
            );
%maxmot = (
            'Sequence_3' => {
                              'TCG' => 1
                            },
            'Sequence_2' => {
                              'TCA' => 1
                            },
            'Sequence_1' => {
                              'TCA' => 2
                            }
          );

This looks much closer to your expected output. But of course I could be completely off with my guesses...


If I understand the logscore calculation correctly, then the score per motif is a constant and hence can be pre-calculated. Which would lead to the following more straightforward approach:

#!/usr/bin/perl
use strict;
use warnings;

use Data::Dumper;

my $width = 3;

my %logodds = (
    A => [0.1, 0.2, 0.3],
    C => [0.2, 0.5, 0.2],
    G => [0.3, 0.2, 0.4],
    T => [0.4, 0.1, 0.1],
);

# calculate log score for each motif combination
my $motif_score = {'' => 0}; # start with a 0-length motif
foreach my $offset (0..$width - 1) {
    my %scores;

    # for all motifs...
    foreach my $prefix (keys %{ $motif_score }) {
        my $base_score = $motif_score->{$prefix};

        # ... add another base to the motif
        for my $base (qw(A G C T)) {
            $scores{"${prefix}${base}"} = $base_score + $logodds{$base}[$offset];
        }
    }

    # store the scores for the new sequences
    $motif_score = \%scores;
}

#print Data::Dumper->Dump([$motif_score], [qw(motif_score)]);

my $id;
my %maxmot;
while (<DATA>) {
    if (/^>(.+)/) {
        $id = $1;
    } else {
        chomp(my $sequence = $_);
        my $max = -1;

        # run a window of length $width over the sequence
        foreach my $index (0..length($sequence) - $width - 1) {

            # extract the motif at $index from sequence
            my $motif = substr($sequence, $index, $width);
            my $score = $motif_score->{$motif};

            # update max score if the motif has a higher score
            if ($score > $max) {
                $max         = $score;
                $maxmot{$id} = $motif;
            }
        }
    }
}

print Data::Dumper->Dump([\%maxmot], [qw(*maxmot)]);

exit 0;

__DATA__
>Sequence_1
TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT
>Sequence_2
CCCACGCAGCCGCCCTCCTCCCCGGTCACTGACTGGTCCTG
>Sequence_3
TCGACCCTCTGGAACCTATCAGGGACCACAGTCAGCCAGGCAAG

Test run:

$ perl dummy.pl
%maxmot = (
            'Sequence_2' => 'TCA',
            'Sequence_3' => 'TCG',
            'Sequence_1' => 'TCA'
          );

As the logscore per motif is a constant, the motif list sorted by logscore order will also be a constant. Given that list we will only have to find the first motif that matches on a given sequence. Hence we can apply the highly optimized regular expression engine on the problem. Depending on your actual problem size this will probably be the more efficient solution:

#!/usr/bin/perl
use warnings;
use strict;

use List::Util qw(first pairs);
use Data::Dumper;

my $width = 3;

my %logodds = (
    A => [0.1, 0.2, 0.3],
    C => [0.2, 0.5, 0.2],
    G => [0.3, 0.2, 0.4],
    T => [0.4, 0.1, 0.1],
);
my @bases = keys %logodds;

# calculate log score for each motif combination
my $motif_logscore = {'' => 0}; # start with a 0-length motif
foreach my $offset (0..$width - 1) {
    my %score;

    # for all motifs...
    foreach my $prefix (keys %{ $motif_logscore }) {
        my $base_score = $motif_logscore->{$prefix};

        # ... add another base to the motif
        for my $base (@bases) {
            $score{"${prefix}${base}"} = $base_score + $logodds{$base}[$offset];
        }
    }

    # update hash ref to new motif scores
    $motif_logscore = \%score;
}

#print Data::Dumper->Dump([$motif_logscore], [qw(motif_logscore)]);

my @motifs_sorted =
    # list of [<motif>, <regular expression>] array refs
    map    { [$_->[0], qr/$_->[0]/] }
    # sort in descending numerical score order
    sort   { $b->[1] cmp $a->[1] }
    # list of [<motif>, <score>] array refs
    pairs %{ $motif_logscore };

#print Data::Dumper->Dump([\@motifs_sorted], [qw(*motifs_sorted)]);

my $id;
my %maxmot;
while (<DATA>) {
    if (/^>(.+)/) {
        $id = $1;
    } else {
        my $sequence = $_;
        # find the first pair where the regex matches -> store motif
        $maxmot{$id} = (
            first { ($sequence =~ $_->[1])[0] } @motifs_sorted
        )->[0];
    }
}

print Data::Dumper->Dump([\%maxmot], [qw(*maxmot)]);

exit 0;

__DATA__
>Sequence_1
TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT
>Sequence_2
CCCACGCAGCCGCCCTCCTCCCCGGTCACTGACTGGTCCTG
>Sequence_3
TCGACCCTCTGGAACCTATCAGGGACCACAGTCAGCCAGGCAAG
Stefan Becker
  • 5,695
  • 9
  • 20
  • 30
  • The only thing I can't really explain is why `$width` is configurable. The array refs stored in `%logodds` would need to be of length `$width` too. But currently they are of fixed length 3. Maybe your algorithm is only supposed to work with widths of 1, 2 and 3? – Stefan Becker Mar 31 '19 at 09:27
  • Judging by the explanation of the logscore calculation in your previous [question](https://stackoverflow.com/questions/55326503/eliminating-unitialized-values-in-my-perl-hash-of-arrays) I think my guesses are correct and the above answer shows the correct algorithm. For your example `CAG` the log score would be `0.2 + 0.2 + 0.4`. Sequence 1 has `CAG` f.ex. at index `1`, hence `$loscore{'Sequence_1'}[1]` is `0.8`. – Stefan Becker Mar 31 '19 at 10:00
  • Added an alternative implementation based on the realization that the score per motif is a constant and can be per-calculated. – Stefan Becker Mar 31 '19 at 10:47
  • The reason why `$width` is not configurable in this code is that I wanted to try to keep the post somewhat short. I was told to do that last time I posted. You are correct with the calculation of log-odds scores – William Mar 31 '19 at 14:15
  • I had a hunch that `%logodds` looked odd. So are the values in it dependent on `$width` or would you need to calculate `%logodds` for each sequence? Do any of my proposed solutions solve your issue? – Stefan Becker Mar 31 '19 at 14:52
  • I'm working through your solutions now. I appreciate the help! The amount of values in `%logodds` depends on `$width`. What I omitted (to save space in the post) is how I calculate these values. – William Mar 31 '19 at 15:22
  • I randomly seed a start position for the motif to separate motif from background in each sequence. Then, I calculate background counts and frequencies of the bases and position-specific counts and frequencies of the bases within each motif. I add `1` to each motif count to avoid a `0` value. I calculate the log-odds score, specific to each position, by taking the `log base2` of `motif frequency / background frequency`. I just set the table to specific values for simplicity in the post – William Mar 31 '19 at 15:24
  • So `%logodds` is specific per sequence or specific to the set of sequences? As the code is now the same `%logodds` will be used for all sequences in the input. – Stefan Becker Mar 31 '19 at 15:29
  • `%logodds` is specific to the set of sequences. The overall goal is to iterate through the code `~500` times to get `%maxscore` for each sequence to converge. After this, I will iterate the random seeding `~50` times to eliminate case-specific convergences. So, the `%logodds` table changes each iteration (which is why I want to access the values in it after the first calculation of `%maxscore`). BUT, honestly, my goal for the post is just to complete `1` iteration. – William Mar 31 '19 at 15:36
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/190983/discussion-between-stefan-becker-and-william). – Stefan Becker Mar 31 '19 at 16:00
0

Maybe you don't need an array but a hash?

Change the push to

undef $maxmot{$id}{ $HoA{$id}[$pos3] };

It creates a hash (with undefined values, so only the keys are important). In the output, I don't see more than 3 keys for each sequence in the input from your linked question.

choroba
  • 231,213
  • 25
  • 204
  • 289
  • I have a file that contains 29 sequences that I am using. I didn't included all of them to try to keep the question short. – William Mar 30 '19 at 23:44
  • I wanted a hash of arrays because I want to access the base at each location later in the code - I am going to create counts of each base at location, manipulate the counts, etc. I thought an array would allow that. I am open to ideas of a better way to do this though. – William Mar 30 '19 at 23:46
  • For counting you can use `++$maxmot{$id}{ $HoA{$id}[$pos3] }`. I don't understand what you mean by "the order that will return the best log-odds score". – choroba Mar 30 '19 at 23:50
  • So here is an example: say `Sequence 1` is `ACCTGA`, and we are using a motif width of `3`. Naturally, we know that the options for a motif are `ACC, CCT, CTG, TGA`. So, based on the score that each corresponds to, one of these motifs will return the maximum score. I think I give a better explanation of calculating a log-odds score here: https://stackoverflow.com/questions/55326503/eliminating-unitialized-values-in-my-perl-hash-of-arrays – William Mar 30 '19 at 23:58