1

I mapped my reads to my assembly using the bwa mem algorithm and extracted the number of reads per base (= coverage) using samtools depth. The resulting file is the following:

1091900001  1   236
1091900001  2   245
1091900001  3   265
1091900001  4   283
1091900001  5   288
1091900002  1   297
1091900002  2   312
1091900002  3   327
1091900002  4   338
1091900002  5   348

with three columns: name of the contig (since it is a multi-contig file, this ID changes) - position (base) - number of reads that mapped (coverage).

Now I want to calculate the coverage (third column) in sliding windows; in a window size of 3 and slide of 2 as the mean - per contig (first column).

I want to use the rollapply function of the zoo package.

require(zoo)
cov <- read.table("file",header=FALSE, sep="\t", na.strings="NA", dec=".", strip.white=TRUE)
library(reshape) #loads the library to rename the column names
cov<-rename(cov,c(V1="Chr", V2="locus", V3="depth")) #renames the header
rollapply(cov$depth, width = 3, by = 2, FUN = mean, align = "left")

But this of course does not take into account the contig. Plus, my expected output should include the contig-info & the window, it was calculated:

1091900001  1   3   248.6667
1091900001  3   5   278.6667
1091900002  1   3   312.0000
1091900002  3   5   337.6667

Is there an easy way for doing that in R?

rororo
  • 815
  • 16
  • 31

1 Answers1

1

Here's how you can do this with the dplyr functions group_by and do:

library(dplyr)

cov %>% 
  group_by(Chr) %>% 
  do(
    data.frame(
      window.start = rollapply(.$locus, width=3, by=2, FUN=min, align="left"),
      window.end = rollapply(.$locus, width=3, by=2, FUN=max, align="left"),
      coverage = rollapply(.$depth, width=3, by=2, FUN=mean, align="left")
      )
    )

# # A tibble: 4 x 4
# # Groups:   Chr [2]
#          Chr window.start window.end coverage
#        <int>        <int>      <int>    <dbl>
# 1 1091900001            1          3 248.6667
# 2 1091900001            3          5 278.6667
# 3 1091900002            1          3 312.0000
# 4 1091900002            3          5 337.6667

do allows you to return an arbitrary number of values from grouped operations in the form of a data.frame. In this case, we return a the rolling mean of the coverage value, along with the min and max values from locus from each window.

Edit:

If your dataset is large, you may be better off performing the calculation using data.table. Its syntax is a bit harder to understand if you haven't seen it before, but it can offer substantial speed improvements in grouped operations on bigger data. Here's how your operation works with data.table:

library(data.table)    

setDT(cov)
cov[, .(
      window.start = rollapply(locus, width=3, by=2, FUN=min, align="left"),
      window.end = rollapply(locus, width=3, by=2, FUN=max, align="left"),
      coverage = rollapply(depth, width=3, by=2, FUN=mean, align="left")
      ),
    .(Chr)]

Based on the sample rows you provided, here are the benchmark results for the dplyr and data.table approaches (measured in milliseconds):

# dplyr:
      min       lq     mean   median       uq      max neval
 7.811753 8.685976 10.10268 9.243551 10.42691 144.5274  1000

# data.table:
      min       lq     mean  median       uq      max neval
 1.924472 2.105459 2.510832 2.30479 2.685706 8.848451  1000

So on the sample data, the data.table option is about 4x faster on average.

cmaher
  • 5,100
  • 1
  • 22
  • 34
  • what does `%>%` mean? I've never seehn this before – rororo Feb 10 '18 at 17:04
  • It is known as the pipe operator -- see an introduction [here](http://seananderson.ca/2014/09/13/dplyr-intro.html). It increases readability, when chaining functions, by allowing you to write `df %>% func1 %>% func2` instead of `func2(func1(df))`. – cmaher Feb 10 '18 at 17:07
  • works perfectly, but since my file contains ~4.3 Mio lines, the message `| | 0% ~11 h remaining` pops up – rororo Feb 10 '18 at 22:49
  • @rororo please see my edited answer -- I added an alternative approach that will offer improved speed. – cmaher Feb 10 '18 at 23:58
  • yes this does speed up things a lot. I just realized that the function does not work if a `Chr` is smaller than the window size. Also, it does not calculate the 'last' window, so the small bit that is left if it is smaller than the window. I.e. if a `Chr` is 5235 long and the window size should be 5000, the last 'window' of 235 is not taken – rororo Feb 11 '18 at 00:16
  • @rororo add the argument `partial=TRUE` to all three calls of `rollapply`, as in `rollapply(depth, width=3, by=2, FUN=mean, align="left", partial=TRUE)` – cmaher Feb 11 '18 at 00:26
  • ha! didn't think that it was that easy. Many thanks! – rororo Feb 11 '18 at 10:05