Genome coverage as sliding window

Question

I mapped my reads to my assembly using the bwa mem algorithm and extracted the number of reads per base (= coverage) using samtools depth. The resulting file is the following:

1091900001  1   236
1091900001  2   245
1091900001  3   265
1091900001  4   283
1091900001  5   288
1091900002  1   297
1091900002  2   312
1091900002  3   327
1091900002  4   338
1091900002  5   348

with three columns: name of the contig (since it is a multi-contig file, this ID changes) - position (base) - number of reads that mapped (coverage).

Now I want to calculate the coverage (third column) in sliding windows; in a window size of 3 and slide of 2 as the mean - per contig (first column).

I want to use the rollapply function of the zoo package.

require(zoo)
cov <- read.table("file",header=FALSE, sep="\t", na.strings="NA", dec=".", strip.white=TRUE)
library(reshape) #loads the library to rename the column names
cov<-rename(cov,c(V1="Chr", V2="locus", V3="depth")) #renames the header
rollapply(cov$depth, width = 3, by = 2, FUN = mean, align = "left")

But this of course does not take into account the contig. Plus, my expected output should include the contig-info & the window, it was calculated:

1091900001  1   3   248.6667
1091900001  3   5   278.6667
1091900002  1   3   312.0000
1091900002  3   5   337.6667

Is there an easy way for doing that in R?

cmaher · Accepted Answer · 2018-02-10T23:57:12.957

Here's how you can do this with the dplyr functions group_by and do:

library(dplyr)

cov %>% 
  group_by(Chr) %>% 
  do(
    data.frame(
      window.start = rollapply(.$locus, width=3, by=2, FUN=min, align="left"),
      window.end = rollapply(.$locus, width=3, by=2, FUN=max, align="left"),
      coverage = rollapply(.$depth, width=3, by=2, FUN=mean, align="left")
      )
    )

# # A tibble: 4 x 4
# # Groups:   Chr [2]
#          Chr window.start window.end coverage
#        <int>        <int>      <int>    <dbl>
# 1 1091900001            1          3 248.6667
# 2 1091900001            3          5 278.6667
# 3 1091900002            1          3 312.0000
# 4 1091900002            3          5 337.6667

do allows you to return an arbitrary number of values from grouped operations in the form of a data.frame. In this case, we return a the rolling mean of the coverage value, along with the min and max values from locus from each window.

Edit:

If your dataset is large, you may be better off performing the calculation using data.table. Its syntax is a bit harder to understand if you haven't seen it before, but it can offer substantial speed improvements in grouped operations on bigger data. Here's how your operation works with data.table:

library(data.table)    

setDT(cov)
cov[, .(
      window.start = rollapply(locus, width=3, by=2, FUN=min, align="left"),
      window.end = rollapply(locus, width=3, by=2, FUN=max, align="left"),
      coverage = rollapply(depth, width=3, by=2, FUN=mean, align="left")
      ),
    .(Chr)]

Based on the sample rows you provided, here are the benchmark results for the dplyr and data.table approaches (measured in milliseconds):

# dplyr:
      min       lq     mean   median       uq      max neval
 7.811753 8.685976 10.10268 9.243551 10.42691 144.5274  1000

# data.table:
      min       lq     mean  median       uq      max neval
 1.924472 2.105459 2.510832 2.30479 2.685706 8.848451  1000

So on the sample data, the data.table option is about 4x faster on average.

It is known as the pipe operator -- see an introduction [here](http://seananderson.ca/2014/09/13/dplyr-intro.html). It increases readability, when chaining functions, by allowing you to write `df %>% func1 %>% func2` instead of `func2(func1(df))`. — cmaher, Feb 10 '18 at 17:07
works perfectly, but since my file contains ~4.3 Mio lines, the message `| | 0% ~11 h remaining` pops up — rororo, Feb 10 '18 at 22:49
@rororo please see my edited answer -- I added an alternative approach that will offer improved speed. — cmaher, Feb 10 '18 at 23:58
yes this does speed up things a lot. I just realized that the function does not work if a `Chr` is smaller than the window size. Also, it does not calculate the 'last' window, so the small bit that is left if it is smaller than the window. I.e. if a `Chr` is 5235 long and the window size should be 5000, the last 'window' of 235 is not taken — rororo, Feb 11 '18 at 00:16
@rororo add the argument `partial=TRUE` to all three calls of `rollapply`, as in `rollapply(depth, width=3, by=2, FUN=mean, align="left", partial=TRUE)` — cmaher, Feb 11 '18 at 00:26

Genome coverage as sliding window

1 Answers1

Linked