2

I have the following vector v:

c("tactagcaatacgcttgcgttcggtggttaagtatgtataatgcgcgggcttgtcgt",
"tgctatcctgacagttgtcacgctgattggtgtcgttacaatctaacgcatcgccaa",
"gtactagagaactagtgcattagcttatttttttgttatcatgctaaccacccggcg")

i'm facing a very upsetting issue here. Each element of this vector is a DNA sequence. What i want to do is split each element 2 letters by 2 and obtain the count of occurrences of each pair of letters. The desired output would be exactly this for the first element:

AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT 
 3  2  2  4  1  0  6  3  0  6  4  7  7  2  5  4 

This result is achieved easily using the function oligonucleotideFrequency. The problem is that this function will not apply over a list or a vector using sapply or lapply and i don't understand where is the problem and how to fix it.

If i do:

oligonucleotideFrequency(DNAString(v[1]), width = 2)

It works and i get this output:

AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT 
 3  2  2  4  1  0  6  3  0  6  4  7  7  2  5  4

but if i do:

v <- DNAString(v)
lapply(v, oligonucleotideFrequency(v, width = 2)

This is what i get:

Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘oligonucleotideFrequency’ for signature ‘"list"

Same occurs with sapply.

If i check the class of v after applying the DNAString function it returns "list" so idon't get where is the problem here.

Even if i do:

oligonucleotideFrequency(v[1], width = 2)

it returns:

Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘oligonucleotideFrequency’ for signature ‘"list"’

I'm totally lost here, please help, i've been hours breaking my head into this, how can i fis this problem?? I want to apply this function to the whole vector at once.

PD: The R package containing this functions os called Biostrings and it can be downloaded and installed from here

Thanks in advance

Miguel 2488
  • 1,410
  • 1
  • 20
  • 41

2 Answers2

1
x = c("tactagcaatacgcttgcgttcggtggttaagtatgtataatgcgcgggcttgtcgt",
      "tgctatcctgacagttgtcacgctgattggtgtcgttacaatctaacgcatcgccaa",
      "gtactagagaactagtgcattagcttatttttttgttatcatgctaaccacccggcg")

nc = c("a", "c", "t", "g")
lv = sort(Reduce(paste0, expand.grid(replicate(2, nc, simplify = FALSE))))
lapply(x, function(s)
    table(factor(sapply(seq(2, nchar(s), 1), function(i)
        substring(s, i - 1, i)),
        levels = lv)))
#[[1]]

#aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt 
# 3  2  2  4  1  0  6  3  0  6  4  7  7  2  5  4 

#[[2]]

#aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt 
# 3  4  1  4  5  2  4  4  2  4  1  5  3  5  6  3 

#[[3]]

#aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt 
# 2  4  4  4  3  3  2  4  2  4  1  3  7  1  3  9 
d.b
  • 32,245
  • 6
  • 36
  • 77
  • Thank you d.b. it does the job indeed but what i was looking for was a way to implement the `oligonucleotideFrequency` function on the whole list – Miguel 2488 Apr 01 '19 at 19:50
1

There are two ways to use the lapply function.

The first one is to provide a user-defined function and set all the arguments inside the function like the following.

library(Biostrings)

v <- c("tactagcaatacgcttgcgttcggtggttaagtatgtataatgcgcgggcttgtcgt",
       "tgctatcctgacagttgtcacgctgattggtgtcgttacaatctaacgcatcgccaa",
       "gtactagagaactagtgcattagcttatttttttgttatcatgctaaccacccggcg")


lapply(v, function(x) oligonucleotideFrequency(DNAString(x), width = 2))
# [[1]]
# AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT 
# 3  2  2  4  1  0  6  3  0  6  4  7  7  2  5  4 
# 
# [[2]]
# AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT 
# 3  4  1  4  5  2  4  4  2  4  1  5  3  5  6  3 
# 
# [[3]]
# AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT 
# 2  4  4  4  3  3  2  4  2  4  1  3  7  1  3  9 

The second one is to provide the function name, and provide the arguemnts like ... as follows. For this option, the item in the list (in this case, v), automatically goes to the first argument of the fucntion.

library(Biostrings)

v <- c("tactagcaatacgcttgcgttcggtggttaagtatgtataatgcgcgggcttgtcgt",
       "tgctatcctgacagttgtcacgctgattggtgtcgttacaatctaacgcatcgccaa",
       "gtactagagaactagtgcattagcttatttttttgttatcatgctaaccacccggcg")

v <- lapply(v, DNAString)

lapply(v, oligonucleotideFrequency, width = 2)
# [[1]]
# AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT 
# 3  2  2  4  1  0  6  3  0  6  4  7  7  2  5  4 
# 
# [[2]]
# AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT 
# 3  4  1  4  5  2  4  4  2  4  1  5  3  5  6  3 
# 
# [[3]]
# AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT 
# 2  4  4  4  3  3  2  4  2  4  1  3  7  1  3  9  
www
  • 38,575
  • 12
  • 48
  • 84
  • 1
    Thanks www!! This was exactly exactly what iw as looking for!! I must keep learning :) – Miguel 2488 Apr 01 '19 at 19:52
  • Now i have one more question if i may: Imagine i have a dataframe and one of my features is the previous vector of the main question. I would like to make each pair of letters in the sequence become a feature of the dataframe and its value would be total count of all the pairs of letters. it would be something like: feature1: AA. feature 2: AC, and so on. and the values would be something like: for AA it should be 8, for AC 10, and so on for each feature. How could i achieve this? – Miguel 2488 Apr 01 '19 at 19:58
  • @Miguel2488 I think I may know how to do this, but if you don't mind, could you ask a new question with a reproducible example, such as a data frame and the list you want to combine? I will study your reproducible example once you post it. – www Apr 01 '19 at 20:00
  • Allright, i'll create a new question and link it to you here when it's done. Give me a minute please :) – Miguel 2488 Apr 01 '19 at 20:01
  • Hi www here's the [question](https://stackoverflow.com/questions/55462960/how-to-make-the-results-of-a-list-of-counted-values-become-one-hot-like-features). Thanks a lot!! – Miguel 2488 Apr 01 '19 at 20:19