Finding Specific Vector Entries in a Sliding Window

Question

I am trying to create a function that will return counts of specific adjacent nucleotides (CG beside eachother) within a specific window that I have formatted in a vector.

I would like the windows to be 100 nucleotides long and move shift every 10.

The data is setup like this (to 10k entries):

data <- c("a", "g", "t", "t", "g", "t", "t", "a", "g", "t", "c", "t",
          "a", "c", "g", "t", "g", "g", "a", "c", "c", "g", "a", "c")

So far I have tried this:

library(zoo)
library(seqinr)
rollapply(data, width=100, by=10, FUN=count(data, wordsize=2))

But I always get the error

"Error in match.fun(FUN) : 
'count(data, 2)' is not a function, character or symbol"

I have also tried:

starts <- seq(1, length(data)-100, by = 100)
n <- length(starts)
for (i in 1:n){
    chunk <- data[starts[i]:(starts[i]+99)]
    chunkCG <- count(chunk,wordsize=2)
    print (chunkCG)
}

However, I do not know how to save the data that is returned. This approach also does not allow me to overlap frames.

`count(data,wordsize=2)` is not a function. You need `FUN=function(x) count(x, wordsize=2)` probably. Or maybe even `...,FUN=count, wordsize=2)` for your `rollapply` call. — thelatemail, Jun 15 '16 at 23:22
You want for row 1:100, 101:200, etc. the count of "cg" pairs? — Mike H., Jun 15 '16 at 23:52

zielinskipp · Answer 1 · 2016-06-15T23:51:00.900

Your method doesn't overlap, as you call it with by = 100. Otherwise it looks fine. Just change it to 10.

To extract the data from you last try, try creating character vector that will collect the data and then you can extract the proper count with name indexing.

counted_cg <- vector(mode = "character")

for (i in 1:n){
    chunk <- data[starts[i]:(starts[i]+99)]
    chunkCG <- count(chunk,wordsize=2)
    counted_cg <- c(counted_cg, chunkCG["cg"])
}

Mike H. · Accepted Answer · 2016-06-16T23:24:45.277

EDIT: To get the desired output with a 10 observation sliding window you can use a for loop. Since we pre-allocate the size of our result vector, the loop is reasonably fast. I think this is the best way to solve your problem since I dont think a lot of grouping (if any) supports a sliding window:

library(data.table)
set.seed(1)
#Sample data
df<-data.frame(var=sample(c("a","g","t","c"),600,replace=T))

#The number of windows you want, shift by 10 each time
n_windows <- ((nrow(df) - 100) / 10) + 1

#Create empty DF, this helps increase speed of below loop
res <- data.frame(window=rep(NA,n_windows),count_cg=rep(NA,n_windows))

#Loop over each i, paste a leaded version of your sequence onto current sequence and count "cg"s
for (i in 1:n_windows){
      res$window[i] <- paste0((i-1)*10 + 1,"-",(i-1)*10 + 100)
      subs <- df[((i-1)*10 + 1):((i-1)*10 + 100),"var"]
      subs2<- paste0(as.character(subs),as.character(shift(subs,1L,type="lead")[1:length(subs) - 1]))
      res$count_cg[i] <- sum(subs2=="cg")
}
   head(res)
  window count_cg
1  1-100       10
2 11-110       10
3 21-120        8
4 31-130        9
5 41-140        9
6 51-150        9

I actually wanted to count 1:100, 11:110, 21:120 etc. – William Luke Jun 16 '16 at 13:23 — William Luke, Jun 16 '16 at 13:23

Finding Specific Vector Entries in a Sliding Window

2 Answers2