4

Hello,

I try to write a program that reads in a FASTA-formatted file containing multiple DNA sequences, identifies all repeated 4-mers (i.e., all 4-mers that occur more than once) in a sequence, and prints out the repeated 4-mer and the header of the sequence in which it was found. A k-mer is simply a sequence of k nucleotides (e.g., “aaca”, “gacg”, and “tttt” are 4-mers).

Here's my code:

use strict;
use warnings;

my $count = -1;
my $file = "sequences.fa";
my $seq = '';
my @header = ();
my @sequences = ();
my $line = '';
open (READ, $file) || die "Cannot open $file: $!.\n";

while ($line = <READ>){
    chomp $line;
    if ($line =~ /^>/){
        push @header, $line;
        $count++;
        unless ($seq eq ''){
            push @sequences, $seq;
            $seq = '';
        }
    } else {
        $seq .= $line;
    }
}   push @sequences, $line;

for (my $i = 0; $i <= $#sequences+1; $i++){
    if ($sequences[$i] =~ /(....)(.)*\g{1}+/g){
        print $header[$i], "\n", $&, "\n";
    }
}

I have two requests: First, I don't know how to design my regex pattern to get the desired output. And second, less importantly, I'm sure my code is very inefficient, so if there's a way to shorten it, please tell me.

Thanks in advance!

Here's an example for a FASTA file: (Note that there's an extra line between the sequences, which is NOT the case in original fasta files)

>NC_001422.1 Enterobacteria phage phiX174 sensu lato, complete genome GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTttttttCGGATATTTCTGATGAGTCGAAAAAT CCCTTACTTGAGGATAtatataAATTATGTCTAATATTCAAACTGGCGCCGAGCGTATGCCGCATGACCT

>NC_001501.1 Enterobacteria phage phiX184 sensu lato, complete genome AACGGCTGGTCAGTATTTAAGGTTAGTGCTGAGGTTGACTACATCTGTTTTTAGAGACCCAGACCTTTTA TCTCACTTCTGTTACTCCAGCTTCTTCGGCACCTGTTTTACAGACACCTAAAGCTACATCGTCAACGTTA TATTTTGATAGTTTGACGGTTAATGCTGGTAATGGTgagagagaGGTTTTCTTCATTGCATTCAGATGGA TCAACGCCGCTAATCAGGTTGTTTCTGTTGGTGCTGATATTGCTTTTGATGCCGACCCTAAATTTTTTGC CTGTTTGGTTCGCTTTGAGTCTTCTTCGGTTCCGACTACCCTCCCGACTGCCTATGATGTTTATCCTTTG

>NC_001622.5 Enterobacteria phage phiX199 sensu lato, complete genome TTCGCTGAATCAGGTTATTAAAGAGTTGCCGAGATATTTATGTTGGTTTCATGCGGATTGGTCGTTTAAA TTGGACTTGGTGGCAAGTCTGCCGCTGATAAAGGAAAGGATAATGACCAAATCAAAGAACTCGTGATTAT CTTGCTGCTGCATTTCCTGAGCTTAATGCTTGGGAGCGTGCTGGTGCTGATGCTTCCTCTGCTGGTATGG TTGACGCCGGATTTGAGAATCAAAAATGTGAGAGAGCTTACTAAAATGCAACTGGACAATCAGAAAGAGA GATGCAAAATGAGACTCAAAAAGAGATTGCTGGCATTCAGTCGGCGACTTCACGCCAGAATACGAAAGAC CAGGTATATGCACAAAATGAGATGCTTGCTTATCAACAGAAGGAGTCTACTGCTCGCGTTGCGTCTATTA TGGAAAACACCAATCTTTCCAAGCAACAGCAGGTTTCCGAGATTATGCGCCAAATGCTTACTCAAGCTCA AACGGCTGGTCAGTATTTTACCAATGACCAAATCAAAGAAATGACTCGCAAGGTTAGTGCTGAGGTTGAC TTAGATGAGTGTTCATCAGCAAACGCAGAATCAGCGGTATGGCTCTTCTCATATTGGCGCTACTGCAAAG

ic23oluk
  • 125
  • 1
  • 9
  • 1
    Well, done for explaining what a 4-mer actually is! Just one question - can they overlap? And do you have some sample data and desired output? – Sobrique Jun 28 '17 at 08:24
  • Yes, they can overlap. I tried to attach a fasta file but it looks like it's not possible. I will copy an example into the question. Unfortunately I don't have a sample for the desired output – ic23oluk Jun 28 '17 at 08:32
  • What version of Perl are you using? `perl -v` – Zaid Jun 28 '17 at 08:44
  • I'm using version 5.18 – ic23oluk Jun 28 '17 at 08:47

1 Answers1

5

I'd probably tackle your problem rather more like this:

#!/usr/bin/env perl

use strict;
use warnings;

use Data::Dumper;

#set paragraph mode. Iterate on blank lines. 
local $/ = ''; 

#read from STDIN or a file specified on command line, 
#e.g. cat filename_here | myscript.pl
#or myscript.pl filename_here
while ( <> ) {
   #capture the header line, and then remove it from our data block
   my ($header) = m/\>(.*)/;
   s/>.*$//;

   #remove linefeeds and whitespace. 
   s/\s*\n\s*//g;
   #use lookahead pattern, so the data isn't 'consumed' by the regex. 
   my @sequences = m/(?=([atcg]{4}))/gi;

   #increment a count for each sequence found. 
   my %count_of;
   $count_of{$_}++ for @sequences;

   #print output. (Modify according to specific needs. 
   print $header,"\n";

   print "Found sequences:\n";
   print Dumper \@sequences;
   print "Count:\n";
   print Dumper \%count_of;

   #note - ordered, but includes duplicates. 
   #you could just use keys  %count_of, but that would be unordered. 
   foreach my $sequence ( grep { $count_of{$_} > 1 } @sequences ) {
      print $sequence, " => ", $count_of{$sequence},"\n";
   }
   print "\n";
}

We iterate record by record, capture and remove the 'header' line, and then splice together the rest. Then capture each (overlapping) sequence of 4, and count them.

This, for your sample data (first stanza for brevity):

NC_001422.1 Enterobacteria phage phiX174 sensu lato, complete genome 
Found sequences:
    GAGT => 2
    AGTT => 2
    TTAT => 2
    CATG => 2
    ATGA => 3
    TGAC => 2
    CGCA => 2
    AGTT => 2
    ACTT => 2
    tttt => 3
    tttt => 3
    tttt => 3
    GGAT => 2
    GATA => 2
    ATAT => 2
    TATT => 2
    ATGA => 3
    TGAG => 2
    GAGT => 2
    AAAA => 2
    AAAA => 2
    ACTT => 2
    TGAG => 2
    GGAT => 2
    GATA => 2
    tata => 2
    tata => 2
    TTAT => 2
    TATG => 2
    ATAT => 2
    TATT => 2
    GCCG => 2
    TATG => 2
    GCCG => 2
    CGCA => 2
    CATG => 2
    ATGA => 3
    TGAC => 2

Note - because it's based on the original sequences, it's based on ordering within the data, and you'll see TGAC there twice because ... it's in there twice.

However you could instead:

   foreach my $sequence ( sort { $count_of{$b} <=> $count_of{$a} }
                          grep { $count_of{$_} > 1 } 
                                 keys %count_of ) {
      print $sequence, " => ", $count_of{$sequence},"\n";
   }
   print "\n";

Which will discard any with less than 2 matches, and order by frequency.

Sobrique
  • 52,974
  • 7
  • 60
  • 101
  • Your code works perfectly fine and thats the output I would expect, but honestly, there is a lot in there that I have not learned so far, for examply hashes. I also don't want to use a module (i.e. dumper). But thanks anyway for your effort :) – ic23oluk Jun 28 '17 at 09:28
  • 1
    Don't avoid modules. That's a fallacy. `Data::Dumper` is core - it ships with perl. But in this case, it's a convenience anyway, mostly for printing diag output. Hashes you should learn, because it's _exactly_ the right tool for counting things. – Sobrique Jun 28 '17 at 09:56
  • 1
    Nice trick with the lookahead capture. An alternative approach with goes directly to the hash: `my %count_of; /^.*?([atgc]{4})(?{ $count_of{$1}++ })(*FAIL)/;` – Zaid Jun 28 '17 at 10:49
  • 2
    @Zaid: That's dreadful. Regex patterns are cryptic enough with their one-letter commands without embedding Perl code inside them. It is twice the length of **Sobrique's** solution, which is the way I would solve this, and no more goes *"directly to the hash"* than his answer. Please don't promote hacky solutions to simple problems in an effort to show off. Perl already has an undeserved reputation for being hard to read without you adding fuel to the fire. – Borodin Jun 28 '17 at 11:21
  • 1
    @ic23oluk: *"I also don't want to use a module (i.e. dumper)"* This attitude is so frustrating. You are depriving yourself of 90% of the usefulness of Perl for purely imagined reasons. You are already using two "modules" in your own code: `strict` and `warnings`, and `Data::Dumper` is no less trivial to use because it is also part of perl itself. – Borodin Jun 28 '17 at 11:32
  • @ic23oluk: The usual objection to non-core modules is that people don't know how to install them. That is usually just a matter of running `cpan My::Module`, and even if you are running on a shared system where you don't want to modify the public perl then you can easily install it into a local directory for your use only. There really is no reason to cripple Perl just because something about it bothers you. If you're unsure about installing modules locally then ask a question on Stack Overflow and we will help. – Borodin Jun 28 '17 at 11:34
  • I have some sympathy with places where it's hard to download and install stuff 'because policy'. I've had to fight that attitude for many years. But that doesn't apply to core stuff anyway. – Sobrique Jun 28 '17 at 11:46
  • @Borodin: please don't get me wrong. I am currently learning Perl I want want to avoid Modules right now, but I will, of course, make use of them later. – ic23oluk Jun 28 '17 at 12:41
  • @ic23oluk: Learning what modules are available and how to use them is central to learning Perl. `Data::Dumper` in particular is a very valuable debugging tool, and it would be foolish to artificially restrict yourself to a cut-down version of perl that just makes everything harder. As I said, you should remove `strict` and `warnings` as well if you want to avoid modules altogether. – Borodin Jun 28 '17 at 12:46
  • 1
    @Sobrique: Of course, if the company has a policy of disallowing third-party code in general then I agree with you. But the usual reasons for avoiding CPAN modules are not knowing how to install them, or not wanting or being able to modify a shared perl. Those issues have very straightforward solutions that often aren't appreciated. – Borodin Jun 28 '17 at 12:50
  • 2
    Yeah, but seriously - don't remove `strict;` `warnings`. Just stick with core modules if that's the limit of it. – Sobrique Jun 28 '17 at 12:55