2

My data.frame looks like this: The col1 defines the start of a range when the direction is " + ", while the col2 establishes the beginning of a range when the direction is " - ".

library(tidyverse)
df <- tibble(organ=c(rep("liver",5), rep("lung",5)),
             col1=c(1,10,100,40,1000,1,10,100,40,1000), 
             col2=c(15,20,50,80,2000,15,20,50,80,2000), 
             direction=c("+","+","-","+","+","+","+","-","+","+"), 
             score=c(50,100,300,10,300,50,100,300,10,300))

df
#> # A tibble: 10 × 5
#>    organ  col1  col2 direction score
#>    <chr> <dbl> <dbl> <chr>     <dbl>
#>  1 liver     1    15 +            50
#>  2 liver    10    20 +           100
#>  3 liver   100    50 -           300
#>  4 liver    40    80 +            10
#>  5 liver  1000  2000 +           300
#>  6 lung      1    15 +            50
#>  7 lung     10    20 +           100
#>  8 lung    100    50 -           300
#>  9 lung     40    80 +            10
#> 10 lung   1000  2000 +           300

Created on 2022-07-29 by the reprex package (v2.0.1)

For each organ group_by(organ),
I want to consider the direction of each row, identify for which rows the ranges are overlapping, and then keep the rows with the highest score.

I want my data to look like this.

#>    organ  col1  col2 direction score
#>    <chr> <dbl> <dbl> <chr>     <dbl>
#>  1 liver    10    20 +           100
#>  3 liver   100    50 -           300
#>  5 liver  1000  2000 +           300
#>  7 lung     10    20 +           100
#>  8 lung    100    50 -           300
#> 10 lung   1000  2000 +           300

I have been thinking of this for a long time. Any guidance or help is highly appreciated.

Henrik
  • 65,555
  • 14
  • 143
  • 159
LDT
  • 2,856
  • 2
  • 15
  • 32

1 Answers1

1

I would consider using something like IRanges from Bioconductor to manipulate ranges. This answer provides a useful approach.

You can create an IRanges object and use pmin and pmax to always have a lower/higher value to use (alternative to using direction). Using findOverlaps will help you group together those ranges that overlap with each other.

# Requires bioconductor
# https://bioconductor.org/install/

library(tidyverse)
library(IRanges)
         
ir <- IRanges(pmin(df$col1, df$col2), pmax(df$col1, df$col2))

df %>%
  group_by(organ, grp = subjectHits(findOverlaps(ir, reduce(ir)))) %>%
  slice_max(score) %>%
  ungroup %>%
  select(-grp)

Output

  organ  col1  col2 direction score
  <chr> <dbl> <dbl> <chr>     <dbl>
1 liver    10    20 +           100
2 liver   100    50 -           300
3 liver  1000  2000 +           300
4 lung     10    20 +           100
5 lung    100    50 -           300
6 lung   1000  2000 +           300
Ben
  • 28,684
  • 5
  • 23
  • 45