1

I have two data.tables that provide sequence coordinates across different chromosomes (categories). For example:

library(data.table)
dt1 <- data.table(chromosome = c("1", "1", "1", "1", "X"),
                  start = c(1, 50, 110, 150, 110),
                  end = c(11, 100, 121, 200, 200))
dt2 <- data.table(chromosome = c("1", "1", "X"),
                  start = c(12, 60, 50),
                  end = c(20, 115, 80))

I need to create a third data.table that provides coordinates for sequences that contain all the integers in dt1 that do not overlap with any integers from the sequences in dt2. For example:

dt3 <- data.table(chromosome = c("1", "1", "1", "1", "X"),
                  start = c(1, 50, 116, 150, 110),
                  end = c(11, 59, 121, 200, 200))

The data.tables I need to run this on are very large and therefore I need to maximise performance. I have tried using the foverlaps() function but to no avail. Any help would be greatly appreciated!

Powege
  • 685
  • 5
  • 12

2 Answers2

4

You can start with something like this from foverlaps

setkey(dt2,chromosome,start,end)
ds = foverlaps(dt1,dt2,  type="any")
ds[,.(chromosome, 
      start = fcase(is.na(start) | i.start <= start,i.start,
                    i.end >= end, end + 1),
      end = fcase(is.na(end) | i.end >= end, i.end,
                  i.start <= start, start - 1)
      )]
#   chromosome start   end
#       <char> <num> <num>
#1:          1     1    11
#2:          1    50    59
#3:          1   116   121
#4:          1   150   200
#5:          X   110   200
Peace Wang
  • 2,399
  • 1
  • 8
  • 15
1

For the sake of completeness, there is a concise solution using the GenomicRanges packages from Bioconductor:

library(GenomicRanges)
setdiff(makeGRangesFromDataFrame(dt1), makeGRangesFromDataFrame(dt2))
GRanges object with 5 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        1      1-11      *
  [2]        1     50-59      *
  [3]        1   116-121      *
  [4]        1   150-200      *
  [5]        X   110-200      *
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

In case the result is required to be of class data.table:

library(data.table) # development version 1.14.3 used
library(GenomicRanges)
setdiff(makeGRangesFromDataFrame(dt1), makeGRangesFromDataFrame(dt2)) |> 
  as.data.table() |>
  DT(, .(chromosome = seqnames, start, end))
   chromosome start   end
       <fctr> <int> <int>
1:          1     1    11
2:          1    50    59
3:          1   116   121
4:          1   150   200
5:          X   110   200

As mentioned by Waldi, the GenomicRanges package is not available from CRAN. Waldi has provided the link to the installation guide in the BiocManager vignette. Here is the short version:

install.packages("BiocManager")
BiocManager::install("GenomicRanges")
Uwe
  • 41,420
  • 11
  • 90
  • 134
  • 1
    As `GenomicRanges` isn't on CRAN, this link explains [how to install BioConductor Packages](https://cran.r-project.org/web/packages/BiocManager/vignettes/BiocManager.html) – Waldi Dec 27 '21 at 07:51