Using Perl to iterate through a string 3 positions at a time

Question

I have written the following code in Perl. I want to iterate through a string 3 positions (characters) at a time. If TAA, TAG, or TGA (stop codons) appear, I want to print till the stop codons and remove the rest of the characters.

Example:

data.txt

ATGGGTAATCCCTAGAAATTT

ATGCCATTCAAGTAACCCTTT

Answer:

ATGGGTAATCCCTAG (last 6 characters removed)

ATGCCATTCAAGTAA (last 6 characters removed)

(Each sequence begins with ATG).

Code:

#!/usr/bin/perl -w

open FH, "data.txt";
@a=<FH>;

foreach $tmp(@a)
{
  for (my $i=0; $i<(length($tmp)-2); $i+=3)
  {
    if ($tmp=~/(ATG)(\w+)(TAA|TAG|TGA)\w+/)
    {
      print "$1$2$3\n";
    }
    else 
    { 
      print "$tmp\n"; 
    }
    $tmp++;
  }
}
exit;

However, my code is not giving the correct result. There should not be any overlaps in the characters (I want to move every 3 characters).

Can someone suggest how to fix the error?

Thanks!

Do you want to remove everything after the last of the stop codons, or do you want to remove everything after the first of the stop codons? — ikegami, Apr 01 '12 at 15:18

Ωmega · Accepted Answer · 2012-04-01T18:45:05.563

-1

Script:

#!/usr/bin/perl

use strict;
use warnings;

open FH, "data.txt";
my @a = <FH>;

foreach (@a) {
  print /^(ATG(...)*?(TAA|TAG|TGA))/? $1 : $_, "\n";
}

Output:

ATGGGTAATCCCTAG
ATGCCATTCAAGTAA

edited Apr 01 '12 at 18:45

answered Apr 01 '12 at 15:49

Ωmega

42,614
34
134
203

score -2 · Answer 2 · answered Apr 01 '12 at 15:53

-2

I think this code will do. It uses \w{3} - three-symbol codons as you need.

#!/usr/bin/perl -w
open FH, "data.txt";
@a=<FH>;
foreach $tmp(@a) {
  if ($tmp=~ /^(ATG(?:\w{3})*(?:TAA|TAG|TGA)).*/) {
    print "$1\n";
  } else {
    print "$tmp\n";
  }
}

answered Apr 01 '12 at 15:53

mcsi

359
1
2

...iterate through a string 3 positions (**characters**) at a time... `\w{3}` is not for any 3 characters – Ωmega Apr 01 '12 at 16:04

ikegami · Answer 3 · 2012-04-01T17:07:43.583

-2

You say you want to remove everything after the first stop codon. If so, all you need is

while (<FH>) {
   s/(?<=TAA|TAG|TGA).*//;
   print;
}

But then there's the mystical "I want to iterate through a string 3 positions (characters) at a time" requirement. That doesn't make any sense. Perhaps you want the match to occur at a position that's divisible by three? If so, you'd use

s/^(?:.{3})*?(?:TAA|TAG|TGA)\K.*//;    # Requires 5.10+
s/^((?:.{3})*?(?:TAA|TAG|TGA)).*/$1/;  # Backwards compatible

edited Apr 01 '12 at 17:07

answered Apr 01 '12 at 15:56

ikegami

367,544
15
269
518

...iterate through a string **3** positions (characters) at a time... that `3` is what is important – Ωmega Apr 01 '12 at 16:03
@stackoverflow, The first pattern does look only three characters at a time. The second doesn't, because it's possible to match more than three characters by only looking at 3. – ikegami Apr 01 '12 at 17:04
You updated your response, so now you see your mistake in original response :) Also - there is no need to use `?:` grouping, as you use just `$1` anyway... – Ωmega Apr 01 '12 at 17:13
@stackoverflow, Backwards. I do need to use grouping (`(?:)`); I don't require capturing (`()`) as well. – ikegami Apr 01 '12 at 19:52
@stackoverflow, It's not a mistake to answer what was asked even if it's not the right answer. Acme::ESP doesn't always work. – ikegami Apr 01 '12 at 19:54
`?:` is non-capturing **grouping** – Ωmega Apr 02 '12 at 01:00
@stackoverflow, I know, that's what I just said. Why? – ikegami Apr 02 '12 at 07:03

score -2 · Answer 4 · answered Apr 01 '12 at 23:46

-2

May I suggest a reading of perlretut (about 4 paragraphs down from here)? It actually covers almost exactly this situation with avoiding overlaps and finding stop codons.

answered Apr 01 '12 at 23:46

MkV

3,046
22
16

Using Perl to iterate through a string 3 positions at a time

4 Answers4