2

I am trying to create a function that will return counts of specific adjacent nucleotides (CG beside eachother) within a specific window that I have formatted in a vector.

I would like the windows to be 100 nucleotides long and move shift every 10.

The data is setup like this (to 10k entries):

data <- c("a", "g", "t", "t", "g", "t", "t", "a", "g", "t", "c", "t",
          "a", "c", "g", "t", "g", "g", "a", "c", "c", "g", "a", "c")

So far I have tried this:

library(zoo)
library(seqinr)
rollapply(data, width=100, by=10, FUN=count(data, wordsize=2))

But I always get the error

"Error in match.fun(FUN) : 
'count(data, 2)' is not a function, character or symbol"

I have also tried:

starts <- seq(1, length(data)-100, by = 100)
n <- length(starts)
for (i in 1:n){
    chunk <- data[starts[i]:(starts[i]+99)]
    chunkCG <- count(chunk,wordsize=2)
    print (chunkCG)
}

However, I do not know how to save the data that is returned. This approach also does not allow me to overlap frames.

zielinskipp
  • 120
  • 5
  • 2
    `count(data,wordsize=2)` is not a function. You need `FUN=function(x) count(x, wordsize=2)` probably. Or maybe even `...,FUN=count, wordsize=2)` for your `rollapply` call. – thelatemail Jun 15 '16 at 23:22
  • You want for row 1:100, 101:200, etc. the count of "cg" pairs? – Mike H. Jun 15 '16 at 23:52

2 Answers2

0

Your method doesn't overlap, as you call it with by = 100. Otherwise it looks fine. Just change it to 10.

To extract the data from you last try, try creating character vector that will collect the data and then you can extract the proper count with name indexing.

counted_cg <- vector(mode = "character")

for (i in 1:n){
    chunk <- data[starts[i]:(starts[i]+99)]
    chunkCG <- count(chunk,wordsize=2)
    counted_cg <- c(counted_cg, chunkCG["cg"])
}
zielinskipp
  • 120
  • 5
0

EDIT: To get the desired output with a 10 observation sliding window you can use a for loop. Since we pre-allocate the size of our result vector, the loop is reasonably fast. I think this is the best way to solve your problem since I dont think a lot of grouping (if any) supports a sliding window:

library(data.table)
set.seed(1)
#Sample data
df<-data.frame(var=sample(c("a","g","t","c"),600,replace=T))

#The number of windows you want, shift by 10 each time
n_windows <- ((nrow(df) - 100) / 10) + 1

#Create empty DF, this helps increase speed of below loop
res <- data.frame(window=rep(NA,n_windows),count_cg=rep(NA,n_windows))

#Loop over each i, paste a leaded version of your sequence onto current sequence and count "cg"s
for (i in 1:n_windows){
      res$window[i] <- paste0((i-1)*10 + 1,"-",(i-1)*10 + 100)
      subs <- df[((i-1)*10 + 1):((i-1)*10 + 100),"var"]
      subs2<- paste0(as.character(subs),as.character(shift(subs,1L,type="lead")[1:length(subs) - 1]))
      res$count_cg[i] <- sum(subs2=="cg")
}
   head(res)
  window count_cg
1  1-100       10
2 11-110       10
3 21-120        8
4 31-130        9
5 41-140        9
6 51-150        9
Mike H.
  • 13,960
  • 2
  • 29
  • 39