1

what should i do to calculate percentage of occurrence of characters in an argument if the data are

t<-c(UUU,UUC,UUA,UUG,CUU,CUC,CUA,CUG,AUU,AUC,AUA,AUG,GUU,GUC,GUA,GUG,UCU,UCC,UCA,UCG,CCU,CCC,CCA,CCG,ACU,ACC,ACA,ACG,GCU,GCC,GCA,GCG,UAU,UAC,UAA,UAG,CAU,CAC,CAA,CAG,AAU,AAC,AAA,AAG,GAU,GAC,GAA,GAG,UGU,UGC,UGA,UGG,CGU,CGC,CGA,CGG,AGU,AGC,AGA,AGG,GGU,GGC,GGA,GGG)

i want to make a function regarding this which may help me in future to calculate more problems in future.

suppose our argument would be-

(UUUUUCUUAUUGCUUCUCCUACUGAUUAUCAUAAUGGUUGUCGUAGUGUCUUCCUCAUCGCCUCCCCCACCGACUACCACAACGGCUGCCGCAGCGUAUUACUAAUAGCAUCACCAACAGAAUAACAAAAAGGAUGACGAAGAGUGUUGCUGAUGGCGUCGCCGACGGAGUAGCAGAAGAGGUGGCGGAGGG)

also, the reading frame would start right in the starting which separate in the number of 3(e.g-AUG,GUG) I got this code which is following but i want my answer in the form of list with two columns named count and percentage, please help me in modify this code to give percentage in required manner.

    seqn <- c("UUA","AUC","GUA", "UUA", "GAU", "UUA") #your sequence
l_seq <- length(seqn) 
u_seq <- unique(seqn) 
seq_long <- "UUUAUGGGCG"
seqn <- unlist(str_extract_all(seq_long, pattern = "[AUGC]{3}"))

colSums(sapply(u_seq, function(s) str_count(string = seqn,pattern = s)))/l_seq

help me in correcting this code i want my argument continuous like UGCUGCUAUGAAUGAUG

  • 1
    Welcome to Stack Overflow! Please take the [tour] and read through the [help], in particular [*How do I ask a good question?*](/help/how-to-ask) Your best bet here is to do your research, [search](/help/searching) for related topics on SO, and give it a go. ***If*** you get stuck and can't get unstuck after doing more research and searching, post a [mcve] of your attempt and say specifically where you're stuck. People will be glad to help. Good luck! – T.J. Crowder May 28 '18 at 12:34
  • Is there any notion of separation of three base pairs in the input string? – Tim Biegeleisen May 28 '18 at 12:39
  • @TimBiegeleisen i didnt get you sir. – Mayank Rajput May 29 '18 at 08:57

1 Answers1

0

This might work for you:

require(stringr)
bases <- c("U","A","G","C")
sapply(bases, function(b) str_count(string = c("UUA","AUC","GUA"),pattern = b))

     U A G C
[1,] 2 1 0 0
[2,] 1 1 0 1
[3,] 1 1 1 0

EDIT: basic genetics

EDIT2: as per your comment this might work

seqn <- c("UUA","AUC","GUA", "UUA", "GAU", "UUA") #your sequence
l_seq <- length(seqn) #length of sequence
u_seq <- unique(seqn) #unique codons

# This calculates the fractions of the unique codons in your sequence
colSums(sapply(u_seq, function(s) str_count(string = seqn,pattern = s)))/l_seq

      UUA       AUC       GUA       GAU 
0.5000000 0.1666667 0.1666667 0.1666667 

EDIT3: As per your second question you can split your string in 3 letter codons like so:

seq_long <- "UUUAUGGGCG"
seqn <- unlist(str_extract_all(seq_long, pattern = "[AUGC]{3}"))

and run the code from EDIT2. If your sequence is not a multiple of 3 you will lose the last letters. You can solve this with some padding.

moooh
  • 459
  • 3
  • 10
  • thank you so much @moooh but i want to find the percentage of UUT and TUC etc together in the set of 3. For example i want the answer like UUT = 3%, TUC=5% – Mayank Rajput May 28 '18 at 12:44
  • @MayankRajput I made an edit that could work for you. – moooh May 28 '18 at 12:54
  • Thank you so much for your help sir. – Mayank Rajput May 28 '18 at 12:58
  • What if i want a continuous argument like UUUAUGGGC. what should i change to do this? – Mayank Rajput May 28 '18 at 15:54
  • @MayankRajput check my 2nd edit – moooh May 28 '18 at 16:36
  • is this code correct?- seq_long <- "UUUAUGGGCG" seqn <- unlist(str_extract_all(seq_long, pattern = "[AUGC]{3}")) l_seq <- length(seqn) #length of sequence u_seq <- unique(unlist(seqn)) #unique codons # This calculates the percentages of the unique codons in your sequence colSums(sapply(u_seq, function(s) str_count(string = seqn,pattern = s)))/1_seq – Mayank Rajput May 29 '18 at 02:14