Translating a cDNA to amino acids using Perl

Question

So I am trying to translate a complementary strand of DNA to it's respective amino acids. So far I have this code:

#!/usr/bin/perl

open (INFILE, "sumaira2.out");
open (OUTFILE3, ">>sumaira3.out");

%aacode = (
  TTT => "F", TTC => "F", TTA => "L", TTG => "L",
  TCT => "S", TCC => "S", TCA => "S", TCG => "S",
  TAT => "Y", TAC => "Y", TAA => "STOP", TAG => "STOP",
  TGT => "C", TGC => "C", TGA => "STOP", TGG => "W",
  CTT => "L", CTC => "L", CTA => "L", CTG => "L",
  CCT => "P", CCC => "P", CCA => "P", CCG => "P",
  CAT => "H", CAC => "H", CAA => "Q", CAG => "Q",
  CGT => "R", CGC => "R", CGA => "R", CGG => "R",
  ATT => "I", ATC => "I", ATA => "I", ATG => "M",
  ACT => "T", ACC => "T", ACA => "T", ACG => "T",
  AAT => "N", AAC => "N", AAA => "K", AAG => "K",
  AGT => "S", AGC => "S", AGA => "R", AGG => "R",
  GTT => "V", GTC => "V", GTA => "V", GTG => "V",
  GCT => "A", GCC => "A", GCA => "A", GCG => "A",
  GAT => "D", GAC => "D", GAA => "E", GAG => "E",
  GGT => "G", GGC => "G", GGA => "G", GGG => "G",
); # this is the hash table for the amino acids

while ($line=<INFILE>){
  $codon = $codon.$line;
  @array = split "",$codon;
} # splits all the characters in the text

for ($count = 0; $count<scalar@array; $count= $count + 3) {
  $codon = $codon.$array[$count].$array[$count+1].$array[$count+2];
  $aminoacid = $aacode{$codon};
} # tells how to read the codon and execute the hash table

$protein = $protein.$aminoacid; #catenate the string

print OUTFILE3 $protein;

My infile has the reverse complementary DNA already, I am just trying to translate it. For some reason there is nothing in my output. I have no idea what's going wrong since Terminal is not giving me any errors either. Any help would be highly appreciated.

And here's a sample of the file I am trying to translate:

TCGTCGCCTCCCCAACCTAGGTAGTCCGTTGCTGCCCGACGACGGCCGGTAGTCGCCT GCGTCCCTCCTGAAAGGCGTTGGCCGGCAAGCTACGCCGTGGCTACCGGAAGCGCGTCCCCATCAC GCGGTCCTAACTGAACGCGACGGGATGGAGAGTGATCACTCCCCGCCGTCGCGTAGTTCGCCACTC

And it continues on and on for 17 more lines.

Kenosis · Answer 1 · 2014-02-04T04:40:01.453

Perhaps the following will be helpful:

use strict;
use warnings;

my %aacode = (
  TTT => "F", TTC => "F", TTA => "L", TTG => "L",
  TCT => "S", TCC => "S", TCA => "S", TCG => "S",
  TAT => "Y", TAC => "Y", TAA => "STOP", TAG => "STOP",
  TGT => "C", TGC => "C", TGA => "STOP", TGG => "W",
  CTT => "L", CTC => "L", CTA => "L", CTG => "L",
  CCT => "P", CCC => "P", CCA => "P", CCG => "P",
  CAT => "H", CAC => "H", CAA => "Q", CAG => "Q",
  CGT => "R", CGC => "R", CGA => "R", CGG => "R",
  ATT => "I", ATC => "I", ATA => "I", ATG => "M",
  ACT => "T", ACC => "T", ACA => "T", ACG => "T",
  AAT => "N", AAC => "N", AAA => "K", AAG => "K",
  AGT => "S", AGC => "S", AGA => "R", AGG => "R",
  GTT => "V", GTC => "V", GTA => "V", GTG => "V",
  GCT => "A", GCC => "A", GCA => "A", GCG => "A",
  GAT => "D", GAC => "D", GAA => "E", GAG => "E",
  GGT => "G", GGC => "G", GGA => "G", GGG => "G",
); # this is the hash table for the amino acids

my $compDNA = uc do { local $/; <> };
$compDNA =~ s/\s+//g;

my @codons = unpack '(A3)*', $compDNA;
my @aminoAcids = map { exists $aacode{$_} ? $aacode{$_} : "?$_?" } @codons;
print join '', @aminoAcids;

Usage: perl script.pl compDNA_File [>aminoAcid_File]

The last, optional parameter directs output to a file.

First, the entire file is slurped (and converted to all upper-case) into a variable. Next, all whitespace is removed. unpack is used to create a list of three-character elements (codons). map is used to translate the codons to amino acids using the hash you provided. (Note that if there is not a key for the codon, the codon is inserted, enclosed by question marks.) Finally, those amino acids are joined to form a single string, and the result is printed.

Hey guys, I really appreciate all your help. I tried everyone's corrections, however, I still cannot get it work. I think I am going to create a while loop in "$aminoacid = $aacode{$codon}".I'm just not sure how to execute this hash, I'll try different variation with the map and join command! Again thank you for your help! — user3268152, Feb 05 '14 at 03:03
@user3268152 - It may help of you post a sample of your data that you need to translate; add it to your original question. — Kenosis, Feb 05 '14 at 03:05
I have added a few line of my data that needs to be translated and added to my original question. — user3268152, Feb 05 '14 at 15:30
@user3268152 - Using that data, the script generated the following: `SSPPQPRSTOPSVAARRRPVVACVPPERRWPASYAVATGSASPSRGPNSTOPTRRDGESTOPSLPAVASTOPFAT?C?` The `?C?` occurred because "C" wasn't a codon, and the script encloses 'unknown' codons within question marks. Did you execute the script from the command line, as shown in the "Usage:" above? How do you handle "STOP"? — Kenosis, Feb 05 '14 at 18:54

score 0 · Answer 2 · answered Feb 04 '14 at 03:30

Don't you want to put

print OUTFILE3 $protein;

Inside your for loop so you print out every protien you are dealing with, rather than the very last one you have left after your for loop finishes running, like so?

for ($count = 0; $count<scalar@array; $count= $count + 3) {
  $codon = $codon.$array[$count].$array[$count+1].$array[$count+2];
  $aminoacid = $aacode{$codon};

  print OUTFILE3 $aminoacid;

} # tells how to read the codon and execute the hash table

score 0 · Answer 3 · answered Feb 04 '14 at 03:32

Try the script below executed as scriptname < sumaira2.out >> sumaira3.out.
Set $DEBUG to zero to remove debugging output if it works as expected.

#!/usr/bin/perl
use strict; use warnings;

my $DEBUG = 2;

my %aacode = (
TTT => "F", TTC => "F", TTA => "L", TTG => "L",
TCT => "S", TCC => "S", TCA => "S", TCG => "S",
TAT => "Y", TAC => "Y", TAA => "STOP", TAG => "STOP",
TGT => "C", TGC => "C", TGA => "STOP", TGG => "W",
CTT => "L", CTC => "L", CTA => "L", CTG => "L",
CCT => "P", CCC => "P", CCA => "P", CCG => "P",
CAT => "H", CAC => "H", CAA => "Q", CAG => "Q",
CGT => "R", CGC => "R", CGA => "R", CGG => "R",
ATT => "I", ATC => "I", ATA => "I", ATG => "M",
ACT => "T", ACC => "T", ACA => "T", ACG => "T",
AAT => "N", AAC => "N", AAA => "K", AAG => "K",
AGT => "S", AGC => "S", AGA => "R", AGG => "R",
GTT => "V", GTC => "V", GTA => "V", GTG => "V",
GCT => "A", GCC => "A", GCA => "A", GCG => "A",
GAT => "D", GAC => "D", GAA => "E", GAG => "E",
GGT => "G", GGC => "G", GGA => "G", GGG => "G",
); # this is the hash table for the amino acids

my ($codon, $protein)  = ('','');
while (<STDIN>){
  chomp; # remove end of line characters
  s/\s//g; # remove whitespaces
  $codon .= $_;
}

print STDERR "DBG Codon: ", $codon, "\n" if $DEBUG >= 1;

my @aminoacids = ( $codon =~ /(...)/sg );

print STDERR "Aminoacids: ", join(" ", @aminoacids), "\n" if $DEBUG >= 2;

for my $aminoacid (@aminoacids) {
  die "Unknown aminoacid: $aminoacid\n" unless exists $aacode{$aminoacid};
  $protein .=  $aacode{$aminoacid};
}

print STDERR "DBG Protein: ", $protein, "\n" if $DEBUG >= 1;

print $protein, "\n";

Thank you for your help! However, it's still not executing the code correctly. I think I'll have to add a while loop in "$aminoacid = $aacode{$codon}" and play with the map and join command. Thank you again for your help! — user3268152, Feb 05 '14 at 03:04
Could you show on SHORT sequence (3-5 aminoacids) and debug output produced what is wrong? `$codon` string is splitted into `@aminoacids` array by single instruction `my @aminoacids = ( $codon =~ /(...)/sg );` — AnFi, Feb 05 '14 at 07:42

SES · Answer 4 · 2014-02-05T23:14:07.770

I highly recommend using BioPerl to solve these kinds of tasks, or some other library/toolkit. The reason is that in addition to there being 3 reading frames, there are 16 codon tables. In my opinion, people have already spent way too much effort on this problem (I also don't see any correct solutions), and doing anything beyond the nontrivial will require a lot more work and code. Here is a simple example of a translation using the standard codon table.

#!/usr/bin/env perl

use strict;
use warnings;
use Bio::SeqIO;

my $usage = "$0 nt.fasta";
my $file  = shift or die $usage;
my $seqio = Bio::SeqIO->new(-file => $file); 

my $seqobj = $seqio->next_seq;   # create a Bio::Seq object
my $trans  = $seqobj->translate; # call the translate method 
                                 # on the Bio::Seq object

print $trans->seq;               # $trans is a Bio::Seq object, 
                                 # so we call the seq method to get the sequence

You would modify this slightly for mulitple sequences, or to use a different codon table. You can also include a custom codon table. There is a nice tutorial on the BioPerl HOWTO page for translating sequences.

EDIT: The two other solutions I tried do work with a sequence only, but don't parse Fasta format as I was assuming. A major practical consideration is that you should insert a symbol (default is a star for BioPerl, but you can change it to whatever you like) into your translation instead of the word "STOP" because it won't be recognized by any other tools. It's also difficult to discern visually.

score 0 · Answer 5 · answered Feb 07 '14 at 19:00

Ok guys,

So I asked my professor and there were multiple things wrong with my code. First of all I used $codon twice while wanting it to do two different things (I used it once in the while loop and one in the for loop). So it was considering the entire infile as a $codon and then executing the hash table after it. The second thing that was wrong (as someone else mentioned previously) was that $protein was not in the for loop and therefore would just give me the very last amino acid. Anyhow, here is the corrected, functioning code:

open (INFILE, "sumaira2.out");
open (OUTFILE3, ">sumaira3.out");

%aacode = (
TTT => "F", TTC => "F", TTA => "L", TTG => "L",
TCT => "S", TCC => "S", TCA => "S", TCG => "S",
TAT => "Y", TAC => "Y", TAA => "STOP", TAG => "STOP",
TGT => "C", TGC => "C", TGA => "STOP", TGG => "W",
CTT => "L", CTC => "L", CTA => "L", CTG => "L",
CCT => "P", CCC => "P", CCA => "P", CCG => "P",
CAT => "H", CAC => "H", CAA => "Q", CAG => "Q",
CGT => "R", CGC => "R", CGA => "R", CGG => "R",
ATT => "I", ATC => "I", ATA => "I", ATG => "M",
ACT => "T", ACC => "T", ACA => "T", ACG => "T",
AAT => "N", AAC => "N", AAA => "K", AAG => "K",
AGT => "S", AGC => "S", AGA => "R", AGG => "R",
GTT => "V", GTC => "V", GTA => "V", GTG => "V",
GCT => "A", GCC => "A", GCA => "A", GCG => "A",
GAT => "D", GAC => "D", GAA => "E", GAG => "E",
GGT => "G", GGC => "G", GGA => "G", GGG => "G",
); # this is the hash table for the amino acids

while ($line=<INFILE>){
$line =~ s/\s+$//;
$sequence = $sequence.$line;
@array = split "",$sequence;
 } # splits all the characters in the text

for ($count = 0; $count<=scalar @array-3; $count= $count + 3) {
$codon = $array[$count].$array[$count+1].$array[$count+2];
$aminoacid = $aacode{$codon};
$protein = $protein.$aminoacid; #catenate the string

} # tells how to read the codon and execute the hash table


print OUTFILE3 $protein;

Thank you again for everyone's help and sorry it took me so long to get back!

This may work but there are quite a lot things technically wrong with this code. For starters, always put `use strict;` and `use warnings;` at the top of your script and use the [3-argument version of open](http://perldoc.perl.org/functions/open.html). Those things will save you a lot of trouble down the road. Also, I strongly suggest you use BioPerl as I described in my answer (for numerous reasons), unless this is a homework assignment. — SES, Feb 07 '14 at 19:09
Yes, I did notice that every code begins with use strict; use warning;. However, my professor has not introduced me to those commands, and I am not sure if using them for the assignment will be appropriate or not. So my questions, what is the purpose of those commands? As far as Bioperl goes, I don't think I'll be able to use it for this class. But hopefully after I secure a research lab, I will be using BioPerl. Thanks for suggestions! — user3268152, Feb 08 '14 at 15:34
Perl [strict](http://perldoc.perl.org/strict.html) and [warnings](http://perldoc.perl.org/warnings.html) are pragmas that enforce better habits by warning you of unsafe code or halting if there are errors. These are enabled by default with recent versions of Perl, which indicates that it really is necessary to understand (or at least use) these pragmas and use lexical variables. Your code may work without these practices (with some versions of Perl), but it is strongly discouraged. — SES, Feb 08 '14 at 18:59

Translating a cDNA to amino acids using Perl

5 Answers5