I have two large datasets that want to match with each other
library(tidyverse)
df1 <- tibble(position=c(10,11,200,250,300))
df1
#> # A tibble: 5 × 1
#> position
#> <dbl>
#> 1 10
#> 2 11
#> 3 200
#> 4 250
#> 5 300
df2 <- tibble(start=c(1,10,200,251),
end=c(20,100,250,350),
name=c("geneA","geneB","geneC","geneD"))
df2
#> # A tibble: 4 × 3
#> start end name
#> <dbl> <dbl> <chr>
#> 1 1 20 geneA
#> 2 10 100 geneB
#> 3 200 250 geneC
#> 4 251 350 geneD
Created on 2022-03-03 by the reprex package (v2.0.1)
I have the position of the genes in the df1 and I want to find based on the range (start-end) from the df2 how many genes can be found in this position.
I want my data to look like this
position start end name
<dbl> <dbl> <dbl> <chr>
1 10 1 20 geneA
2 10 10 100 geneB
3 11 1 20 geneA
4 11 10 100 geneB
5 200 200 250 geneC
6 250 200 250 geneC
7 300 251 350 geneD
One way to solve this could be through crossing and filtering
df1 %>%
crossing(df2) %>%
filter(position >= start & position <= end)
However my dataset is way too large and can not afford crossing or expanding. Any other idea?