1

I have written the following code in Perl. I want to iterate through a string 3 positions (characters) at a time. If TAA, TAG, or TGA (stop codons) appear, I want to print till the stop codons and remove the rest of the characters.


Example:

data.txt

ATGGGTAATCCCTAGAAATTT

ATGCCATTCAAGTAACCCTTT

Answer:

ATGGGTAATCCCTAG (last 6 characters removed)

ATGCCATTCAAGTAA (last 6 characters removed)

(Each sequence begins with ATG).


Code:

#!/usr/bin/perl -w

open FH, "data.txt";
@a=<FH>;

foreach $tmp(@a)
{
  for (my $i=0; $i<(length($tmp)-2); $i+=3)
  {
    if ($tmp=~/(ATG)(\w+)(TAA|TAG|TGA)\w+/)
    {
      print "$1$2$3\n";
    }
    else 
    { 
      print "$tmp\n"; 
    }
    $tmp++;
  }
}
exit;

However, my code is not giving the correct result. There should not be any overlaps in the characters (I want to move every 3 characters).

Can someone suggest how to fix the error?

Thanks!

Ωmega
  • 42,614
  • 34
  • 134
  • 203
zock
  • 223
  • 4
  • 13
  • Do you want to remove everything after the last of the stop codons, or do you want to remove everything after the first of the stop codons? – ikegami Apr 01 '12 at 15:18
  • 1
    I want to remove everything after the first stop codon. – zock Apr 01 '12 at 15:23

4 Answers4

-1

Script:

#!/usr/bin/perl

use strict;
use warnings;

open FH, "data.txt";
my @a = <FH>;

foreach (@a) {
  print /^(ATG(...)*?(TAA|TAG|TGA))/? $1 : $_, "\n";
}

Output:

ATGGGTAATCCCTAG
ATGCCATTCAAGTAA
Ωmega
  • 42,614
  • 34
  • 134
  • 203
-2

I think this code will do. It uses \w{3} - three-symbol codons as you need.

#!/usr/bin/perl -w
open FH, "data.txt";
@a=<FH>;
foreach $tmp(@a) {
  if ($tmp=~ /^(ATG(?:\w{3})*(?:TAA|TAG|TGA)).*/) {
    print "$1\n";
  } else {
    print "$tmp\n";
  }
}
mcsi
  • 359
  • 1
  • 2
  • ...iterate through a string 3 positions (**characters**) at a time... `\w{3}` is not for any 3 characters – Ωmega Apr 01 '12 at 16:04
-2

You say you want to remove everything after the first stop codon. If so, all you need is

while (<FH>) {
   s/(?<=TAA|TAG|TGA).*//;
   print;
}

But then there's the mystical "I want to iterate through a string 3 positions (characters) at a time" requirement. That doesn't make any sense. Perhaps you want the match to occur at a position that's divisible by three? If so, you'd use

s/^(?:.{3})*?(?:TAA|TAG|TGA)\K.*//;    # Requires 5.10+
s/^((?:.{3})*?(?:TAA|TAG|TGA)).*/$1/;  # Backwards compatible
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • ...iterate through a string **3** positions (characters) at a time... that `3` is what is important – Ωmega Apr 01 '12 at 16:03
  • @stackoverflow, The first pattern does look only three characters at a time. The second doesn't, because it's possible to match more than three characters by only looking at 3. – ikegami Apr 01 '12 at 17:04
  • You updated your response, so now you see your mistake in original response :) Also - there is no need to use `?:` grouping, as you use just `$1` anyway... – Ωmega Apr 01 '12 at 17:13
  • @stackoverflow, Backwards. I do need to use grouping (`(?:)`); I don't require capturing (`()`) as well. – ikegami Apr 01 '12 at 19:52
  • @stackoverflow, It's not a mistake to answer what was asked even if it's not the right answer. Acme::ESP doesn't always work. – ikegami Apr 01 '12 at 19:54
  • `?:` is non-capturing **grouping** – Ωmega Apr 02 '12 at 01:00
  • @stackoverflow, I know, that's what I just said. Why? – ikegami Apr 02 '12 at 07:03
-2

May I suggest a reading of perlretut (about 4 paragraphs down from here)? It actually covers almost exactly this situation with avoiding overlaps and finding stop codons.

MkV
  • 3,046
  • 22
  • 16