1

I have a GRanges object with some genomic intervals and some metadata (3 vectors with the coverage of each region in 3 different samples). I have applied:

disjoin(my_data)

to obtain a new GRanges object with the smallest set of unique, non-overlapping pieces.

The problem is that I cannot conserve metadata in my new GRanges object. What I would like to obtain is the mean coverage of genomic regions which included this unique set.

As an example, I would like to turn this metadata:

       sample1   sample2   sample3
1:1-3    30        NA         NA
1:1-4    NA        40         35
1:4-5    35        NA         NA
1:5-7    NA        50         50
1:6-7    60        NA         NA 

into this:

       sample1    sample2     sample3
1:1      30         40          35
1:2      30         40          35
1:3      30         40          35
1:4      35         40          35
1:5      35         50          50
1:6      60         50          50
1:7      60         50          50

How can I achieve that?

Jeni
  • 918
  • 7
  • 19

1 Answers1

1

Here is a data.table approach to conserving metadata for the disjoined set of ranges.

library(GRanges)
library(data.table)
data.disjoin <- disjoin(my_data)
overlaps <- as.data.frame(findOverlaps(data.disjoin,data))
coverage.disjoin <- as.data.table(cbind(overlaps,mcols(my_data)[overlaps$subjectHits,]))
coverage.disjoin <- coverage.disjoin[,
                      lapply(.SD[,-1],function(x){unique(x[!is.na(x)])}),
                      by="queryHits"]
mcols(data.disjoin) <- coverage.disjoin[,.(sample1,sample2,sample3)]

First, find the overlaps between the disjoined set of ranges and the original data. Then collect the coverage for the overlaps into a data.table. Find the unique coverage for that range by sample, removing NA values. Note that .SD is a special symbol for the subsetted data.table for the group. Finally, join the result back onto the disjoined data.

Data

my_data <- GRanges(
  c("chr1","chr1","chr1","chr1","chr1")
  ,IRanges(c(1,1,4,5,6),c(3,4,5,7,7)),
  sample1=c(30,NA,35,NA,60),
  sample2=c(NA,40,NA,50,NA),
  sample3=c(NA,35,NA,50,NA))
Ian Campbell
  • 23,484
  • 14
  • 36
  • 57
  • Hi! I have tried this in my real dataset, but I get the following error: `Error in `[.data.table`(coverage.disjoin, , lapply(.SD[, -1], function(x) { : Supplied 2 items for column 40 of group 2 which has 12 rows. The RHS length must either be 1 (single values are ok) or match the LHS length exactly. If you wish to 'recycle' the RHS please use rep() explicitly to make this intent clear to readers of your code. ` – Jeni May 25 '20 at 18:06