-5

I have a DNA sequence as my argument.

sequence<-c("ATGAATTTTGATTTA")

i want to find how many times ATG repeats and other 64 codons, 64 codons which codes for specific amino acids are

codon <- list(ATA = "I", ATC = "I", ATT = "I", ATG = "M", ACA = "T", 
              ACC = "T", ACG = "T", ACT = "T", AAC = "N", AAT = "N", AAA = "K", 
              AAG = "K", AGC = "S", AGT = "S", AGA = "R", AGG = "R", CTA = "L", 
              CTC = "L", CTG = "L", CTT = "L", CCA = "P", CCC = "P", CCG = "P", 
              CCT = "P", CAC = "H", CAT = "H", CAA = "Q", CAG = "Q", CGA = "R", 
              CGC = "R", CGG = "R", CGT = "R", GTA = "V", GTC = "V", GTG = "V", 
              GTT = "V", GCA = "A", GCC = "A", GCG = "A", GCT = "A", GAC = "D", 
              GAT = "D", GAA = "E", GAG = "E", GGA = "G", GGC = "G", GGG = "G", 
              GGT = "G", TCA = "S", TCC = "S", TCG = "S", TCT = "S", TTC = "F", 
              TTT = "F", TTA = "L", TTG = "L", TAC = "Y", TAT = "Y", TAA = "stop", 
              TAG = "stop", TGC = "C", TGT = "C", TGA = "stop", TGG = "W")

Then, i want to calculate the percentage of codons to form specific amino acids and want to get out put in a following manner.

codon   count   amino_acids  percentage
CTC     19666       L           0.18
CTT     27340       L           0.13
CTA     31534       L           0.20
CTG     76644       L           0.49

Please help me in solving this problem.

  • 2
    Your question for users without bio background is impossible to understand: whats's codon? How it's related to `amino_acids`? Make your question simpler. How did you get `CTC` 19666 counts when it's not in given sequence? – pogibas Jun 10 '18 at 17:44
  • where are you counting "ATG" in? – Onyambu Jun 10 '18 at 17:52
  • DNA translate into amino acids which forms protein which is the functional key for every bioprocessing. Three sets of dna sequence is known as codon for example, ATG, TTT which codes for specific amino acids like M(methionine). I have given a list which includes list of codons which codes for which amino acids. Also CTC counts 19666 is hyptohetical just to give you an idea @PoGibas – Mayank Rajput Jun 10 '18 at 18:00
  • I want to check frequency of all the 64 codons not just single ATG @Onyambu – Mayank Rajput Jun 10 '18 at 18:01
  • Where do you check them in? That is the question – Onyambu Jun 10 '18 at 18:12
  • Or can you give an expected result from the list instead of a hypothetical one? – Onyambu Jun 10 '18 at 18:14
  • You failed to point out, for example, that step 1 is to split into 3-letter groups, no overlaps. That is, the GAA starting in the 3 letter does _not_ count. Correct? – Rick James Jul 07 '18 at 18:34

1 Answers1

1

As long as your codons are aligned with no shifts or gaps

sequence<-c("ATGAATTTTGATTTAATG")

#split into 3-character codons
splitseq<-substring(sequence, seq(1, nchar(sequence)-1, 3), seq(3, nchar(sequence), 3))

[1] "ATG" "AAT" "TTT" "GAT" "TTA" "ATG"

#table them to get the frequency
x<-as.data.frame(table(splitseq))

#match up codon translation
x$codon<-codon[match(x$splitseq, names(codon))]

#get percentage
x$percentage<-x$Freq / sum(x$Freq)

x
splitseq Freq codon percentage
1      AAT    1     N  0.1666667
2      ATG    2     M  0.3333333
3      GAT    1     D  0.1666667
4      TTA    1     L  0.1666667
5      TTT    1     F  0.1666667
Esther
  • 1,115
  • 1
  • 10
  • 15
  • It's a great start, but I'm assuming his DNA string isn't that simple since we won't always have an ATG in the beginning. What we should probably do is run `start <- stringr::str_detect(sequence, "ATG")`, then run it again for every stop codon, then run your script for detecting all the amino acids. Hopefully we don't have to deal with the reverse 3'-5' (and then read everything backwards) – A Duv Jun 11 '18 at 05:40
  • Good point, need more info from OP about what the sequences themselves look like, if they've been pre-filtered, etc. – Esther Jun 11 '18 at 05:49
  • I also just found: https://stackoverflow.com/questions/50655313/how-to-find-specific-frequency-of-a-codon?rq=1 – A Duv Jun 11 '18 at 06:19