Counting nucleotide frequency using perl script

Question

I have this perl script below to calculate sequence length and their frequency along with nucleotide frequency(A,T,G and C). This script works fine for a file with large number of sequences, but it does not give the right result for a file of small size like this:

infile.fasta

>NC_013116_1051_1114
TTGTCCCTTTGAGTCTCTGG
>NC_013116_1051_1114
GCGCAGCCGATATGGATAA
>NC_013116_1051_1114
TCGAGACTTTGTAATGTTTGGG
>NC_013116_1051_1114
TATTCCACGTCAGGTGCTTTT
>NC_013116_1051_1114
TAGAGCCGATTCCAGACTGTTCC
>NC_013116_1051_1114
TACAGGACCAAGCTCTTCACTC
>NC_013116_1051_1114
CCGTCAAGTTCAGCTCCAATAA
>NC_013116_6_301
CCACGCAACGGACAATCAAACA
>NC_013116_6_301
GGACACTTCCAACTATAAATA
>NC_013116_6_301
CCACGCAACGGACAATCAAACA
>NC_013116_1051_1114
GCTCTTCACTCTTCCTCGTCT
>NC_013116_1051_1114
TTGGGAAAAAGAAGTTGCTGCAGC
>NC_013116_1051_1114
TCGCAGTATCTCTGAAGTTG

count.pl

#!/usr/bin/perl -w

#usage ./count.pl infile min_length max_length
#usage ./count.pl infile 18 34

my $min_len = $ARGV[1];
my $max_len = $ARGV[2];
my $read_len = 0;
my @lines = ("header1","sequence","header2","quality");
my @lray = ();
my $count = 0;
my $total = 0;
my $i = 0;

my @Aray = ();
my @Cray = ();
my @Gray = ();
my @Tray = ();

my$FN = "";

for ($i=$min_len; $i<=$max_len; $i++){
   $lray[$i] = 0;
}

open (INFILE, "<$ARGV[0]") || die "couldn't open input file!";
   while (<INFILE>) {
      $lines[$count] = $_;
      chomp($lines[$count]);
      $count++;
      if($count eq 4){
         $read_len = length($lines[1]); 
#         print "$read_len $lines[1]\n";
         $FN = substr $lines[1], 0, 1;  
         $lray[$read_len]++;
         if ($FN eq "T") { $Tray[$read_len]++;} 
         else {        
            if ($FN eq "A"){ $Aray[$read_len]++;}
            else {
               if ($FN eq "C"){ $Cray[$read_len]++;}
               else {
                 if ($FN eq "G"){ $Gray[$read_len]++;}
               }   
            }
         }           
         $count = 0;
      }
   }
print "length\tnumber\tA\tC\tG\tT\n";
for ($i=$min_len; $i<=$max_len; $i++){
   print "$i\t$lray[$i]\t$Aray[$i]\t$Cray[$i]\t$Gray[$i]\t$Tray[$i]\n";
}
exit;

This is the type of result I get from a big file with many sequences.

length  number  A   C   G   T
18  4473    542 710 471 2750
19  12647   990 1680    1103    8874
20  31194   3010    3354    2743    22087
21  61214   6288    7196    5784    41946
22  128642  14596   11902   12518   89626
23  65190   6859    6525    7773    44033
24  10012   611 1401    1112    6888
25  1406    231 192 435 548
26  661 169 91  105 296
27  407 126 81  65  135
28  602 391 49  68  94
29  520 54  30  370 66
30  175 26  93  18  38
31  156 35  28  29  64
32  106 22  16  24  44
33  97  45  17  16  19
34  0

I would really appreciate if you could help me correct this code. Thanks

You should start every Perl file with `use strict; use warnings;`. And don't use `-w`; it's been obsolete since 2000. — melpomene, Dec 15 '17 at 21:05
@melpomene Thanks! I followed your suggestion and still not getting the right results. — MAPK, Dec 15 '17 at 21:13
@melpomene Use of uninitialized value in concatenation (.) or string at count.pl is the error message. — MAPK, Dec 15 '17 at 21:16

clt60 · Answer 1 · 2017-12-16T02:53:39.730

Trying do not reinvent the wheel, so, using the FAST module, got:

use 5.014;
use warnings;
use FAST::Bio::SeqIO;

my $fasta  = FAST::Bio::SeqIO->new(-file => "infile.fasta", -format => 'Fasta');
my $seqnum=0;
while ( my $seq = $fasta->next_seq() ) {
    my $stats;
    $stats->{len} = length($seq->seq);
    $stats->{$_}++ for split //, $seq->seq;
    say ++$seqnum, " @$stats{qw(len A C G T)}";
}

The above, for your demo infile.fasta prints:

1 20 1 5 5 9
2 19 6 4 6 3
3 22 4 2 7 9
4 21 3 5 4 9
5 23 5 7 5 6
6 22 6 8 3 5
7 22 7 7 3 5
8 22 10 8 3 1
9 21 9 5 2 5
10 22 10 8 3 1
11 21 1 9 2 9
12 24 8 3 8 5
13 20 4 4 5 7

or the

use 5.014;
use warnings;
use FAST::Bio::SeqIO;

my $fasta  = FAST::Bio::SeqIO->new(-file => "file.fasta", -format => 'Fasta');
my $stats;
while ( my $seq = $fasta->next_seq() ) {
    my $len = length($seq->seq);
    $stats->{$len}{count}++;
    $stats->{$len}{$_}++ for split //, $seq->seq;
}
say "Length $_ ($stats->{$_}->{count} times) Letters freq: @{$stats->{$_}}{qw(A C G T)}" for sort { $a <=> $b }  keys %$stats;

produce:

Length 19 (1 times) Letters freq: 6 4 6 3
Length 20 (2 times) Letters freq: 5 9 10 16
Length 21 (3 times) Letters freq: 13 19 8 23
Length 22 (5 times) Letters freq: 37 33 19 21
Length 23 (1 times) Letters freq: 5 7 5 6
Length 24 (1 times) Letters freq: 8 3 8 5

and so on...

Thank you so much. How do you aggregate the number of sequences with the same length and the total frequency of each of 4 nucleotides for each sequence length type as shown in the output I have above? — MAPK, Dec 15 '17 at 23:29
@MAPK I haven't idea about **your** output from some mysterious _big file with many sequences_ . I can only work with data you provided - and I provided some demo code - for using modules and counting letters. I know nothing about _nucleotides_ and so on... — clt60, Dec 16 '17 at 02:26

Counting nucleotide frequency using perl script

1 Answers1