-2

I am trying to write a bash script that would be able to read DNA sequences (each line in the file is a sequence) from a file, where sequences are separated by an empty line. I am then to find the amino acid that these DNA sequences encode per codon (each group of three literals.) For example, if I have a file with the sequence:

GCATGCTGCGATAACTTTGGCTGAACTTTGGCTGAAGCATGCTGCGAAACTTTGGCTGAACTTTGGCTG

then starting from GCA (first three literals,) I want to decode the DNA into amino acids based on the following table:

Codon(s)                  Amino-acid
TTT,TTC                   Phe
TTA,TTG,CTT,CTC,CTA,CTG   Leu
ATT,ATC,ATA               Ile
ATG                       Met
GTT,GTC,GTA,GTG           Val
TCT,TCC,TCA,TCG           Ser
CCT,CCC,CCA,CCG           Pro
ACT,ACC,ACA,ACG           Thr
GCT,GCC,GCA,GCG           Ala
TAT,TAC                   Tyr
TAA,TAG                   Stop
CAT,CAC                   His
CAA,CAG                   Gln
AAT,AAC                   Asn
AAA,AAG                   Lys
GAT,GAC                   Asp
GAA,GAG                   Glu
TGT,TGC                   Cys
TGA                       Stop
TGG                       Trp
CGT,CGC,CGA,CGG           Arg
AGT,AGC                   Ser
AGA,AGG                   Arg
GGT,GGC,GGA,GGG           Gly

that is, I need to get:

AlaCysCysAspAsnPheGlyStopThrLeuAlaGluAlaCysCysGluThrLeuAlaGluLeuTrpLeu

Then I need to print the name of each Amino Acid and how many times it was used. For example:

Ala: 4
Cys: 4

and so on. I have 100s of files with DNA sequences in them, but I am not that good at bash. I tried awk and tr but I did not know how to code the table into a bash script.

Faiz Lotfy
  • 121
  • 1
  • 11

2 Answers2

0

Well, that was a fun exercise:

#!/usr/bin/perl
use strict;
use warnings;

my %acid_of;
{
    my $raw = <<'***';
TTT,TTC                   Phe
TTA,TTG,CTT,CTC,CTA,CTG   Leu
ATT,ATC,ATA               Ile
ATG                       Met
GTT,GTC,GTA,GTG           Val
TCT,TCC,TCA,TCG           Ser
CCT,CCC,CCA,CCG           Pro
ACT,ACC,ACA,ACG           Thr
GCT,GCC,GCA,GCG           Ala
TAT,TAC                   Tyr
TAA,TAG                   Stop
CAT,CAC                   His
CAA,CAG                   Gln
AAT,AAC                   Asn
AAA,AAG                   Lys
GAT,GAC                   Asp
GAA,GAG                   Glu
TGT,TGC                   Cys
TGA                       Stop
TGG                       Trp
CGT,CGC,CGA,CGG           Arg
AGT,AGC                   Ser
AGA,AGG                   Arg
GGT,GGC,GGA,GGG           Gly
***

    for my $line (split /\n/, $raw) {
        my ($codons, $acid) = split ' ', $line;
        for my $codon (split /,/, $codons) {
            $acid_of{$codon} = $acid;
        }
    }
}

while (my $line = readline) {
    next if $line !~ /\S/;

    my %count;
    $line =~ s{\G([ACGT]{3})}{
        my $acid = $acid_of{$1};
        $count{$acid}++;
        $acid
    }eg;

    for my $acid (sort keys %count) {
        $line .= "$acid: $count{$acid}\n";
    }
} continue {
    print $line;
}
melpomene
  • 84,125
  • 8
  • 85
  • 148
  • Thanks for help. I don't know perl and I was hoping to have a script in bash. Coding the conversion table in a bash script is my problem. – Faiz Lotfy Oct 08 '15 at 07:39
  • @FaizLotfy Why use bash for this? – melpomene Oct 08 '15 at 07:44
  • Because it is the only scripting language I know. I am already studying Python but it is bash that I can use for now. – Faiz Lotfy Oct 08 '15 at 07:48
  • @FaizLotfy You could probably use a loop to repeatedly extract the first three chars of a variable, then use `case`/`esac` to do the substitution. But man, that would be clunky (and probably not very fast). – melpomene Oct 08 '15 at 07:51
  • I will try your suggestion. Thank you again. – Faiz Lotfy Oct 08 '15 at 07:55
0

Another option is to use awk:

$ echo 'TTT,TTC Phe
TTA,TTG,CTT,CTC,CTA,CTG Leu
ATT,ATC,ATA Ile
ATG Met
GTT,GTC,GTA,GTG Val
TCT,TCC,TCA,TCG Ser
CCT,CCC,CCA,CCG Pro
ACT,ACC,ACA,ACG Thr
GCT,GCC,GCA,GCG Ala
TAT,TAC Tyr
TAA,TAG Stop
CAT,CAC His
CAA,CAG Gln
AAT,AAC Asn
AAA,AAG Lys
GAT,GAC Asp
GAA,GAG Glu
TGT,TGC Cys
TGA Stop
TGG Trp
CGT,CGC,CGA,CGG Arg
AGT,AGC Ser
AGA,AGG Arg
GGT,GGC,GGA,GGG Gly'>table
$ echo GCATGCTGCGATAACTTTGGCTGAACTTTGGCTGAAGCATGCTGCGAAACTTTGGCTGAACTTTGGCTG>seq
$ awk 'NR==FNR{split($1,x,",");for(i in x)a[x[i]]=$2;next}{o="";for(i=1;i<length($0);i+=3)o=o a[substr($0,i,3)];print o}' table seq
AlaCysCysAspAsnPheGlyStopThrLeuAlaGluAlaCysCysGluThrLeuAlaGluLeuTrpLeu

You can also look up the standard codon table from an online TSV file:

$ awk 'NR==FNR{a[$1]=$2;next}{o="";for(i=1;i<length($0);i+=3)o=o a[substr($0,i,3)];print o}' <(curl http://homes.di.unimi.it/~re/Corsi/INGM_master_BFG/codon.txt|grep -v \#) seq
AlaCysCysAspAsnPheGlyStpThrLeuAlaGluAlaCysCysGluThrLeuAlaGluLeuTrpLeu

Or this uses single letter amino acid codes with the standard codon table:

$ printf %s\\n AAA:K AAC:N AAG:K AAT:N ACA:T ACC:T ACG:T ACT:T AGA:R AGC:S AGG:R AGT:S ATA:I ATC:I ATG:M ATT:I CAA:Q CAC:H CAG:Q CAT:H CCA:P CCC:P CCG:P CCT:P CGA:R CGC:R CGG:R CGT:R CTA:L CTC:L CTG:L CTT:L GAA:E GAC:D GAG:E GAT:D GCA:A GCC:A GCG:A GCT:A GGA:G GGC:G GGG:G GGT:G GTA:V GTC:V GTG:V GTT:V TAA:\* TAC:Y TAG:\* TAT:Y TCA:S TCC:S TCG:S TCT:S TGA:\* TGC:C TGG:W TGT:C TTA:L TTC:F TTG:L TTT:F|tr : \ >table2
$ awk 'NR==FNR{a[$1]=$2;next}{o="";for(i=1;i<length($0);i+=3)o=o a[substr($0,i,3)];print o}' table2 seq
ACCDNFG*TLAEACCETLAELWL

You can also use seqkit translate (https://bioinf.shenwei.me/seqkit/):

$ brew install seqkit
[...]
$ seqkit translate<<<$'>test\nGCATGCTGCGATAACTTTGGCTGAACTTTGGCTGAAGCATGCTGCGAAACTTTGGCTGAACTTTGGCTG'
>test
ACCDNFG*TLAEACCETLAELWL
$ seqkit translate -l 1 # list standard table (table number 1)
The Standard Code (transl_table=1)
Source: https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=tgencodes#SG1

Initiation Codons:
  ATG, CTG, TTG

Stop Codons:
  TAA, TAG, TGA

Stranslate Table:
  AAA: K, AAC: N, AAG: K, AAT: N
  ACA: T, ACC: T, ACG: T, ACT: T
  AGA: R, AGC: S, AGG: R, AGT: S
  ATA: I, ATC: I, ATG: M, ATT: I

  CAA: Q, CAC: H, CAG: Q, CAT: H
  CCA: P, CCC: P, CCG: P, CCT: P
  CGA: R, CGC: R, CGG: R, CGT: R
  CTA: L, CTC: L, CTG: L, CTT: L

  GAA: E, GAC: D, GAG: E, GAT: D
  GCA: A, GCC: A, GCG: A, GCT: A
  GGA: G, GGC: G, GGG: G, GGT: G
  GTA: V, GTC: V, GTG: V, GTT: V

  TAA: *, TAC: Y, TAG: *, TAT: Y
  TCA: S, TCC: S, TCG: S, TCT: S
  TGA: *, TGC: C, TGG: W, TGT: C
  TTA: L, TTC: F, TTG: L, TTT: F

Or you can run install.packages("BiocManager");BiocManager::install("Biostrings") in R and then run:

$ Rscript -e 'writeLines(as.character(Biostrings::translate(Biostrings::DNAString("GCATGCTGCGATAACTTTGGCTGAACTTTGGCTGAAGCATGCTGCGAAACTTTGGCTGAACTTTGGCTG"))))'
ACCDNFG*TLAEACCETLAELWL
nisetama
  • 7,764
  • 1
  • 34
  • 21