2

I am trying to make a function in R which could calculate the frequency of each codon. We know that methionine is an amino acid which could be formed by only one set of codon ATG so its percentage in every set of sequence is 1. Where as Glycine could be formed by GGT, GGC, GGA, GGG hence the percentage of occurring of each codon will be 0.25. The input would be in a DNA sequence like-ATGGGTGGCGGAGGG and with the help of codon table it could calculate the percentage of each occurrence in an input.

please help me by suggesting ways to make this function.

for example, if my argument is ATGTGTTGCTGG then, my result would be

ATG=1
TGT=0.5
TGC=0.5
TGG=1

Data for R:

codon <- list(ATA = "I", ATC = "I", ATT = "I", ATG = "M", ACA = "T", 
    ACC = "T", ACG = "T", ACT = "T", AAC = "N", AAT = "N", AAA = "K", 
    AAG = "K", AGC = "S", AGT = "S", AGA = "R", AGG = "R", CTA = "L", 
    CTC = "L", CTG = "L", CTT = "L", CCA = "P", CCC = "P", CCG = "P", 
    CCT = "P", CAC = "H", CAT = "H", CAA = "Q", CAG = "Q", CGA = "R", 
    CGC = "R", CGG = "R", CGT = "R", GTA = "V", GTC = "V", GTG = "V", 
    GTT = "V", GCA = "A", GCC = "A", GCG = "A", GCT = "A", GAC = "D", 
    GAT = "D", GAA = "E", GAG = "E", GGA = "G", GGC = "G", GGG = "G", 
    GGT = "G", TCA = "S", TCC = "S", TCG = "S", TCT = "S", TTC = "F", 
    TTT = "F", TTA = "L", TTG = "L", TAC = "Y", TAT = "Y", TAA = "stop", 
    TAG = "stop", TGC = "C", TGT = "C", TGA = "stop", TGG = "W")
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • It looks like your sample data is for python. Did you mean to tag that instead of R? Or (based on the text of your question) did you forget to provide an R structure for sample data? – r2evans Jun 02 '18 at 09:18
  • I've suggested an edit that includes an R structure. If that is not desired, my apologies, I can remove it. – r2evans Jun 02 '18 at 09:22
  • Thank you for this edit, sorry for that. – Mayank Singh Rajput Jun 02 '18 at 09:23
  • It isn't clear what the `"C"` for `"TGT"` is supposed to mean, and how this is supposed to know that you're trying to make Glycine meaning 0.25. It might help to spell-out how you calculated `TGT=0.5` (and the others) based on the structure you provided. – r2evans Jun 02 '18 at 09:27
  • TGT and TGC codes for C in DNA sequence hence probability of formation of C by TGC is 50% and for TGT is also 50%. This is same for the rest of the amino acids. This problem is based on codon usage – Mayank Singh Rajput Jun 02 '18 at 09:35
  • Do you mean that because there are two `"C"` within `codon`, that both `TGC` and `TGT` are required to make Glycine? (How the heck am I supposed to know that Glycine corresponds to `"C"`?) Does this mean that four things are needed to make `"A"`: `GCA`, `GCC`, `GCG`, and `GCT`? – r2evans Jun 02 '18 at 10:02
  • Assuming that is the case, then you have two "problems" to solve: (1) translate from `list(TGT="A")` to `list(TGT=0.5)`; and (2) split a string into substrings of length 3, *assuming it is always a perfect multiple of 3*. After that, you have a perfect look-up table. – r2evans Jun 02 '18 at 10:05
  • @r2evans not in a full term. – Mayank Singh Rajput Jun 05 '18 at 10:09

3 Answers3

2

First, I get my lookup list and sequence.

codon <- list(ATA = "I", ATC = "I", ATT = "I", ATG = "M", ACA = "T", 
              ACC = "T", ACG = "T", ACT = "T", AAC = "N", AAT = "N", AAA = "K", 
              AAG = "K", AGC = "S", AGT = "S", AGA = "R", AGG = "R", CTA = "L", 
              CTC = "L", CTG = "L", CTT = "L", CCA = "P", CCC = "P", CCG = "P", 
              CCT = "P", CAC = "H", CAT = "H", CAA = "Q", CAG = "Q", CGA = "R", 
              CGC = "R", CGG = "R", CGT = "R", GTA = "V", GTC = "V", GTG = "V", 
              GTT = "V", GCA = "A", GCC = "A", GCG = "A", GCT = "A", GAC = "D", 
              GAT = "D", GAA = "E", GAG = "E", GGA = "G", GGC = "G", GGG = "G", 
              GGT = "G", TCA = "S", TCC = "S", TCG = "S", TCT = "S", TTC = "F", 
              TTT = "F", TTA = "L", TTG = "L", TAC = "Y", TAT = "Y", TAA = "stop", 
              TAG = "stop", TGC = "C", TGT = "C", TGA = "stop", TGG = "W")

MySeq <- "ATGTGTTGCTGG"

Next, I load the stringi library and break the sequence into chunks of three characters.

# Load library
library(stringi)

# Break into 3 bases
seq_split <- stri_sub(MySeq, seq(1, stri_length(MySeq), by=3), length=3)

Then, I count the letters that these three base chunks correspond to using table.

# Get associated letters
letter_count <- table(unlist(codon[seq_split]))

Finally, I bind the sequences together with the reciprocal of the count and rename my data frame columns.

# Bind into a data frame
res <- data.frame(seq_split,
                  1/letter_count[match(unlist(codon[seq_split]), names(letter_count))])

# Rename columns
colnames(res) <- c("Sequence", "Letter", "Percentage")

#  Sequence Letter Percentage
#1      ATG      M        1.0
#2      TGT      C        0.5
#3      TGC      C        0.5
#4      TGG      W        1.0
Dan
  • 11,370
  • 4
  • 43
  • 68
  • How to get answer like you got? – Mayank Singh Rajput Jun 05 '18 at 07:53
  • @MayankSinghRajput I'm sure I understand your question. You run the code and the answer is stored in the variable `res`. – Dan Jun 05 '18 at 11:03
  • yes sir i got that, but i want to find the percentage of 64 codons only, i mean if there are 900 codons which codes for C alone than it could conclude percentage among that – Mayank Singh Rajput Jun 05 '18 at 13:06
  • @MayankSinghRajput `i want to find the percentage of 64 codons only` If you only want to consider 64 codons, then only use a list with 64 codons (like above). Or do you mean you want to get the percentage of all 64 codons even if the percentage is zero? – Dan Jun 05 '18 at 13:35
  • Yes sir i want exactly what you have said. – Mayank Singh Rajput Jun 06 '18 at 05:56
2

Two things to solve here:

  1. convert codon to the fractions for each letter

    ( fracs <- 1/table(unlist(codon)) )
    #         A         C         D         E         F         G         H         I 
    # 0.2500000 0.5000000 0.5000000 0.5000000 0.5000000 0.2500000 0.5000000 0.3333333 
    #         K         L         M         N         P         Q         R         S 
    # 0.5000000 0.1666667 1.0000000 0.5000000 0.2500000 0.5000000 0.1666667 0.1666667 
    #      stop         T         V         W         Y 
    # 0.3333333 0.2500000 0.2500000 1.0000000 0.5000000 
    codonfracs <- setNames(lapply(codon, function(x) unname(fracs[x])), names(codon))
    str(head(codonfracs))
    # List of 6
    #  $ ATA: num 0.333
    #  $ ATC: num 0.333
    #  $ ATT: num 0.333
    #  $ ATG: num 1
    #  $ ACA: num 0.25
    #  $ ACC: num 0.25
    
  2. convert the sequence string to a vector of length-3 substrings

    s <- 'ATGTGTTGCTGG'
    
    strsplit3 <- function(s, k=3) {
      starts <- seq.int(1, nchar(s), by=k)
      stops <- c(starts[-1] - 1, nchar(s))
      mapply(substr, s, starts, stops, USE.NAMES=FALSE)
    }
    strsplit3(s)
    # [1] "ATG" "TGT" "TGC" "TGG"
    

From here, it's just a lookup:

codonfracs[ strsplit3(s) ]
# $ATG
# [1] 1
# $TGT
# [1] 0.5
# $TGC
# [1] 0.5
# $TGG
# [1] 1

EDIT

Since you want the status of the other codons, try this:

x <- codonfracs
x[ ! names(x) %in% strsplit3(s) ] <- 0
str(x)
# List of 64
#  $ ATA: num 0
#  $ ATC: num 0
#  $ ATT: num 0
#  $ ATG: num 1
#  $ ACA: num 0
#  $ ACC: num 0
#  $ ACG: num 0
# ...snip...
#  $ TAT: num 0
#  $ TAA: num 0
#  $ TAG: num 0
#  $ TGC: num 0.5
#  $ TGT: num 0.5
#  $ TGA: num 0
#  $ TGG: num 1
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thank you for this help, but what should i do to get answer for just 64 codons. I want to get the percentage among that 64 codons only – Mayank Singh Rajput Jun 05 '18 at 07:37
  • Also suppose there is no TCC codon in an argument then, i want it will show that TCC=0 – Mayank Singh Rajput Jun 05 '18 at 07:40
  • For TCC=0, just do `res[is.na(res)]<-0` (if `res` is holding the results). "Just 64 codons”, no idea what you're talking about. – r2evans Jun 05 '18 at 13:34
  • Sir for example if we have our argument like ATGATAATCATT then we get a table which says thats ATG codes for M and percentage is ! and ATA, ATC, ATT codes for I and its percentage is 0.3 percent and rest of the 64 lists of codons shows 0 because there is no other codons in an argument – Mayank Singh Rajput Jun 06 '18 at 06:00
  • sir, if my argument contains 1000 to 10k characters and i want to calculate the count of each 64 codons and then check its percentage and get output in a table of this four column(codons, amino acids for which it codes for and count and percentage) then how can i proceed this problem please guide me in this. – Mayank Singh Rajput Jun 08 '18 at 21:09
  • How could i proceed in solving this? – Mayank Singh Rajput Jun 09 '18 at 05:05
  • Mayank, you asked a question, it was answered in three ways. Now you have another question, and though it is related to the first (because to you it is the next problem you must overcome), it is still different enough from the original to warrant [a new question](https://stackoverflow.com/questions/ask). I suggest you start the next question with the `codon` and frequency-finding method you choose (a link to here *plus* the literal values/functions), and then include what you've tried so far to get `table` or `xtabs` (etc) to give you what you need. – r2evans Jun 09 '18 at 05:30
  • Thank you for your suggestion. Also thank you for your efforts – Mayank Singh Rajput Jun 09 '18 at 05:34
2

A slightly different path leads to this solution:

f0 <- function(dna, weight) {
    codons <- regmatches(dna, gregexpr("[ATGC]{3}", dna))
    tibble(id = seq_along(codons), codons = codons) %>%
        unnest() %>%
        mutate(weight = as.vector(wt[codons]))
}

First, codon is just a named vector, not list; here are the weights

codon <- unlist(codon)
weight <- setNames(1 / table(codon)[codon], names(codon))

Second, probably there is a vector of DNA sequences, rather than one.

dna <- c("ATGTGTTGCTGG", "GGTCGTTGTGTA")

To develop the solution, codons can be found by searching for any nucleotide [ACGT] repeated {3} times

codons <- regmatches(dna, gregexpr("[ATGC]{3}", dna))

It seems like it is then convenient to do operations in the tidyverse, creating a tibble (data.frame) where id indicates which sequence the codon is from

library(tidyverse)
tbl <- tibble(id = seq_along(codons), codon = codons) %>% unnest()

and then add the weights

tbl <- mutate(tbl, weight = as.vector(weight[codon]))

so we have

> tbl
# A tibble: 8 x 3
     id codon weight
  <int> <chr>  <dbl>
1     1 ATG    1    
2     1 TGT    0.5  
3     1 TGC    0.5  
4     1 TGG    1    
5     2 GGT    0.25 
6     2 CGT    0.167
7     2 TGT    0.5  
8     2 GTA    0.25 

Standard tidyverse operations could be used for further summary, in particular when the same codon appears multiple times

tbl %>% group_by(id, codon) %>%
    summarize(wt = sum(weight))
Martin Morgan
  • 45,935
  • 7
  • 84
  • 112