For each value in a column, check if it belongs to any interval in another dataframe

Question

Let's say I have a list of positions values :

> head(jap["POS"])
      POS
1  836924
2  922009
3 1036959
4 141607615
5 164000000 
6 118528028 
[...]

And a list of intervals :

> genes_of_interest
       MGAM        SI      TREH    SLC2A2  SLC2A5   SLC5A1  TAS1R3       LCT
1 141607613 164696686 118528026 170714137 9095166 32439248 1266660 136545420
2 141806547 164796284 118550359 170744539 9148537 32509016 1270694 136594754

I want to check for every position in the first dataframe, if it is inside any of the intervals in the second dataframe.

So in this case, I should have

FALSE FALSE FALSE TRUE FALSE TRUE

Since 141607615 belongs to first interval (MGAM) and 118528028 belongs to 3rd interval (TREH).

Do you have any idea how to do this ?

Thanks by advance.

score 1 · Answer 1 · answered Apr 05 '23 at 15:34

With dplyr 1.1.0 and up you can use a non-equi left_join() if you first turn genes_of_interest into a tidy format. This will be very fast and should be very flexible if you have other columns to also join by.

library(dplyr, warn.conflicts = FALSE)
library(tidyr)

jap <- tibble(
  POS = c(836924, 922009, 1036959, 141607615, 164000000, 118528028)
)

genes_of_interest <- tribble(
  ~MGAM, ~SI, ~TREH, ~SLC2A2, ~SLC2A5, ~SLC5A1, ~TAS1R3, ~LCT,
  141607613, 164696686, 118528026, 170714137, 9095166, 32439248, 1266660, 136545420,
  141806547, 164796284, 118550359, 170744539, 9148537, 32509016, 1270694, 136594754
)

# Manipulate `genes_of_interest` into a tidy data format
genes_of_interest <- genes_of_interest %>%
  mutate(bound = c("start", "end")) %>%
  pivot_longer(-bound) %>%
  pivot_wider(names_from = bound, values_from = value) %>%
  mutate(match = TRUE)

genes_of_interest 
#> # A tibble: 8 × 4
#>   name       start       end match
#>   <chr>      <dbl>     <dbl> <lgl>
#> 1 MGAM   141607613 141806547 TRUE 
#> 2 SI     164696686 164796284 TRUE 
#> 3 TREH   118528026 118550359 TRUE 
#> 4 SLC2A2 170714137 170744539 TRUE 
#> 5 SLC2A5   9095166   9148537 TRUE 
#> 6 SLC5A1  32439248  32509016 TRUE 
#> 7 TAS1R3   1266660   1270694 TRUE 
#> 8 LCT    136545420 136594754 TRUE

jap %>%
  left_join(
    genes_of_interest,
    by = join_by(between(POS, start, end)),
    multiple = "any"
  ) %>%
  mutate(match = !is.na(match))
#> # A tibble: 6 × 5
#>         POS name      start       end match
#>       <dbl> <chr>     <dbl>     <dbl> <lgl>
#> 1    836924 <NA>         NA        NA FALSE
#> 2    922009 <NA>         NA        NA FALSE
#> 3   1036959 <NA>         NA        NA FALSE
#> 4 141607615 MGAM  141607613 141806547 TRUE 
#> 5 164000000 <NA>         NA        NA FALSE
#> 6 118528028 TREH  118528026 118550359 TRUE

score 0 · Accepted Answer · answered Apr 04 '23 at 15:39

We can use sapply to go through all columns in genes_of_interest and compare the position shown in jap with the intervals. Then wrap it with another apply to determine if any of the rows is TRUE. Or we can replace the outer apply with as.logical(rowSums()), the outputs for both functions are the same.

Note the between function comes from the dplyr package.

library(dplyr)

apply(sapply(1:ncol(genes_of_interest), \(x) between(jap$POS, genes_of_interest[1, x], genes_of_interest[2, x])), 1, any)

# or 

as.logical(rowSums(sapply(1:ncol(genes_of_interest), \(x) between(jap$POS, genes_of_interest[1, x], genes_of_interest[2, x]))))

Output

[1] FALSE FALSE FALSE  TRUE FALSE  TRUE

score 0 · Answer 3 · answered Apr 04 '23 at 15:41

using matrices:

a <- matrix(jap$POS, nrow(df), ncol(df2))
b <- t(genes_of_interest)
low <- matrix(b[,1], nrow(df), ncol(df2), byrow = TRUE)
up <- matrix(b[,2], nrow(df), ncol(df2), byrow = TRUE)
rowSums(a > low & a < up)>0
[1] FALSE FALSE FALSE  TRUE FALSE  TRUE

For each value in a column, check if it belongs to any interval in another dataframe

3 Answers3

Output