1

I really appreciate if you are reading this and taking your precious time to help me with a problem I have.

In R, I would like to sort data from small, continuous bins from one dataframe to the (non-overlapping) bins of irregular size and distribution in another dataframe for all overlapping intervals.

My first dataframe looks like this (The actual dataframes would be hundreds of thousand of lines):

chr         bin    from  to     BS_seq_Count
SL4.0ch01   1      1     500    3
SL4.0ch01   2      501   1000   10  
SL4.0ch01   3      1001  1500   3   
SL4.0ch02   1      1     500    3
SL4.0ch02   2      501   1000   10  
SL4.0ch02   3      1001  1500   3   
SL4.0ch03   1      1     500    3
SL4.0ch03   2      501   1000   10  
SL4.0ch03   3      1001  1500   3
... 

And this is the dataframe that I would like to overlap it with and sort into corresponding bins:

chr         bin    from  to      
SL4.0ch01   1      200   700   
SL4.0ch01   2      800   1300  
SL4.0ch02   1      300   400    
SL4.0ch03   1      50    600
SL4.0ch03   2      700   800    
SL4.0ch03   3      1000  1200
...

And in the end it should somewhat like this (decimal/rounded does not matter that much, but the counts for partial overlap should also be sorted into the bins):

chr         bin    from  to     count    
SL4.0ch01   1      200   700    5.8
SL4.0ch01   2      800   1350   6.1
SL4.0ch02   1      300   400    0.6
SL4.0ch03   1      50    600    4.7
SL4.0ch03   2      700   800    2
SL4.0ch03   3      1000  1200   1.2
...

I thought of using GenomicRanges with the findOverlaps function, but could not figure out how to get it working correctly in this case.

If anyone has an idea on how to solve this, any help would be greatly appreciated!

Thank you in advance, I wish you a nice weekend and good health!

2 Answers2

0

I'm fairly certain there's a more efficient way with non-equi joins, but here's a way with a data.table cartesian join:

library(data.table)

dt1 <- setorder(data.table(chr = paste0("SL4.0ch01", rep(1:3, each = 3)), bin = rep(1:3, 3), from = rep(c(1, 501, 1001), 3), to = rep(c(500, 1000, 1500), 3), ct = rep(c(3, 10, 3), 3)), chr)
dt2 <- data.table(chr = paste0("SL4.0ch01", c(1,1,2,3,3,3)), bin = c(1,2,1,1,2,3), from = c(200, 800, 300, 50, 700, 1000), to = c(700, 1350, 400, 600, 800, 1200))
dt3 <- merge.data.table(dt1, dt2, by = "chr", allow.cartesian = TRUE)[, overlap := 0]
dt3[from.x < to.y & from.y < to.x, overlap := ct*(pmin(to.x, to.y) - pmax(from.x, from.y))/(to.x - from.x)]
dt2[, count := dt3[, .(count = sum(overlap)), by = .(chr, bin.y)]$count]
dt2
#>           chr bin from   to     count
#> 1: SL4.0ch011   1  200  700 5.7915832
#> 2: SL4.0ch011   2  800 1350 6.1062124
#> 3: SL4.0ch012   1  300  400 0.6012024
#> 4: SL4.0ch013   1   50  600 4.6893788
#> 5: SL4.0ch013   2  700  800 2.0040080
#> 6: SL4.0ch013   3 1000 1200 1.1963928
jblood94
  • 10,340
  • 1
  • 10
  • 15
0

Here is a solution which uses foverlaps(), the version of the IRanges::findOverlaps() function:

library(data.table)
foverlaps(dt1, setkey(dt2, chr, from, to), nomatch = NULL)[
  , .(count = sum(BS_seq_Count / (i.to - i.from + 1L) * 
                    (pmin(to, i.to) - pmax(from, i.from) + 1L))), 
  by = .(chr, bin, from, to)]
         chr   bin  from    to count
      <char> <int> <int> <int> <num>
1: SL4.0ch01     1   200   700 5.806
2: SL4.0ch01     2   800  1350 6.120
3: SL4.0ch02     1   300   400 0.606
4: SL4.0ch03     1    50   600 4.706
5: SL4.0ch03     2   700   800 2.020
6: SL4.0ch03     3  1000  1200 1.220

Data

library(data.table)
dt1 <- fread("
chr         bin    from  to     BS_seq_Count
SL4.0ch01   1      1     500    3
SL4.0ch01   2      501   1000   10  
SL4.0ch01   3      1001  1500   3   
SL4.0ch02   1      1     500    3
SL4.0ch02   2      501   1000   10  
SL4.0ch02   3      1001  1500   3   
SL4.0ch03   1      1     500    3
SL4.0ch03   2      501   1000   10  
SL4.0ch03   3      1001  1500   3")
dt2 <- fread("
chr         bin    from  to      
SL4.0ch01   1      200   700   
SL4.0ch01   2      800   1350  
SL4.0ch02   1      300   400    
SL4.0ch03   1      50    600
SL4.0ch03   2      700   800    
SL4.0ch03   3      1000  1200")
Uwe
  • 41,420
  • 11
  • 90
  • 134