2

I have two large datasets that want to match with each other

library(tidyverse)

df1 <- tibble(position=c(10,11,200,250,300))
df1
#> # A tibble: 5 × 1
#>   position
#>      <dbl>
#> 1       10
#> 2       11
#> 3      200
#> 4      250
#> 5      300

df2 <- tibble(start=c(1,10,200,251),
              end=c(20,100,250,350),
              name=c("geneA","geneB","geneC","geneD"))
df2
#> # A tibble: 4 × 3
#>   start   end name 
#>   <dbl> <dbl> <chr>
#> 1     1    20 geneA
#> 2    10   100 geneB
#> 3   200   250 geneC
#> 4   251   350 geneD

Created on 2022-03-03 by the reprex package (v2.0.1)

I have the position of the genes in the df1 and I want to find based on the range (start-end) from the df2 how many genes can be found in this position.

I want my data to look like this

  position start   end name 
     <dbl> <dbl> <dbl> <chr>
1       10     1    20 geneA
2       10    10   100 geneB
3       11     1    20 geneA
4       11    10   100 geneB
5      200   200   250 geneC
6      250   200   250 geneC
7      300   251   350 geneD

One way to solve this could be through crossing and filtering

df1 %>% 
  crossing(df2) %>% 
  filter(position >= start & position <= end)

However my dataset is way too large and can not afford crossing or expanding. Any other idea?

Waldi
  • 39,242
  • 6
  • 30
  • 78
LDT
  • 2,856
  • 2
  • 15
  • 32

3 Answers3

2

1) SQL engines can perform such operations without crossing. (It may be possible to speed it up even more if you add indexes.)

library(sqldf)

sqldf("select *
  from df1 a
  join df2 b on a.position between b.start and b.end")

2) data.table can also do some sql-like operations. (Be careful because the first variable in each comparison must be from the first data table and the second from the second. They can't be reordered so, for example, the first comparison could not be written as position <= start even though it is mathematically the same.) Again, adding indexes may improve the speed.

library(data.table)

dt1 <- as.data.table(df1)
dt2 <- as.data.table(df2)[, c("start2", "end2") := .(start, end)]
dt2[dt1, on = .(start <= position, end >= position)]
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • wow ! you have opened to me new horizons! I am so grateful. I have something new to dive in. I appreciate – LDT Mar 03 '22 at 13:22
  • I really like it G. Grothendieck. Its beautiful ! the data.table one is so fast. I am bit concert that the data.table changes the start and end position and looks like this> I am putting the first row ```r start end name 10 10 geneA ``` Do you have any idea on how it can be fixed? – LDT Mar 03 '22 at 16:04
  • 1
    Put copies of start and end in dt2. See revision. – G. Grothendieck Mar 03 '22 at 16:40
1

crossing is a wrapper around expand_grid and does additional stuff e.g. filtering. You can use it directly:

library(tidyverse)

df1 <- tibble(position = c(10, 11, 200, 250, 300))
df1
#> # A tibble: 5 × 1
#>   position
#>      <dbl>
#> 1       10
#> 2       11
#> 3      200
#> 4      250
#> 5      300


df2 <- tibble(
  start = c(1, 10, 200, 251),
  end = c(20, 100, 250, 350),
  name = c("geneA", "geneB", "geneC", "geneD")
)

expand_grid(df1, df2) %>%
  filter(position >= start & position <= end)
#> # A tibble: 7 × 4
#>   position start   end name 
#>      <dbl> <dbl> <dbl> <chr>
#> 1       10     1    20 geneA
#> 2       10    10   100 geneB
#> 3       11     1    20 geneA
#> 4       11    10   100 geneB
#> 5      200   200   250 geneC
#> 6      250   200   250 geneC
#> 7      300   251   350 geneD

Created on 2022-03-03 by the reprex package (v2.0.0)

danlooo
  • 10,067
  • 2
  • 8
  • 22
  • thank you danloo. My data is way too big and I can not expand it this way. The same like crossing. thank you for your time, I have upvoted your answer – LDT Mar 03 '22 at 13:09
  • Way too big is not precise enough to optimize the code. You can filter df1 and / or df2 before expanding to remove unwanted stuff e.g. `df1 %>% filter(position > max(df2$end) | position < min(df2$start))`. Are df1 or df2 grouped? If yes, this will reduce speed drastically. – danlooo Mar 03 '22 at 13:15
  • Thank you danloo for your comment. This is the error that I get "Error: Long vectors are not yet supported. Requested output size must be less than 2147483647". The size of my df2 is 904335 and of my df1 6018 I am not what do you mean by df2 and df1 grouped? Could you explain it further. Thank you for your time and your kind comment – LDT Mar 03 '22 at 13:17
  • `class(df1)` should not contain the phrase "grouped_df". Do you have so many rows? Then shrading might be an option. – danlooo Mar 03 '22 at 13:19
  • 1
    You can just process a smaller chunk of df1 at the same time. Your problem is embarrassingly parallel. – danlooo Mar 03 '22 at 13:24
  • indeed thank you for the kind suggestion – LDT Mar 03 '22 at 13:25
1

Here is a dplyr way (sort of).

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df1 <- tibble(position = c(10, 11, 200, 250, 300))

df2 <- tibble(
  start = c(1, 10, 200, 251),
  end = c(20, 100, 250, 350),
  name = c("geneA", "geneB", "geneC", "geneD")
)

vbetween <- function(data, col, data2, start, end){
  f <- function(x, l, r) l <= x & x <= r
  col <- enquo(col)
  start <- enquo(start)
  end <- enquo(end)
  x <- data %>% pull(!!col)
  l <- data2 %>% pull(!!start)
  r <- data2 %>% pull(!!end)
  yes <- lapply(x, f, l = l, r = r)
  lapply(yes, \(i) data2[i, ])
}

df1 %>% vbetween(position, df2, start, end) %>% bind_rows()
#> # A tibble: 7 x 3
#>   start   end name 
#>   <dbl> <dbl> <chr>
#> 1     1    20 geneA
#> 2    10   100 geneB
#> 3     1    20 geneA
#> 4    10   100 geneB
#> 5   200   250 geneC
#> 6   200   250 geneC
#> 7   251   350 geneD

Created on 2022-03-03 by the reprex package (v2.0.1)

Rui Barradas
  • 70,273
  • 8
  • 34
  • 66