1

I have an issue with R, trying to code something for very large tables.

I'm trying to do, for each line in test_cds and for corresponding positions, the sum of the coverage $cov in test_cov table.

For example, for test_cds line 1 :

         seqid source type start end
1 NW_019942502 Gnomon  CDS     1   3

positions between 1 and 3 include, use :

> test_cov

        seqid  pos cov
1 NW_019942502   1  13
2 NW_019942502   2  16
3 NW_019942502   3  20

and do : sum(cov) for pos 1,2,3 = 13+16+20 in order to output :

> test_cds

         seqid source type start end sum_coverage
1 NW_019942502 Gnomon  CDS     1   3           49

Warning : $pos range from 1 to +++ for each $seqid.

Here's my input tables :

> test_cov

        seqid  pos cov
1 NW_019942502   1  13
2 NW_019942502   2  16
3 NW_019942502   3  20
(...)
4 NW_019942502  13  16
5 NW_019942502  14  16
6 NW_019942502  15  18

> test_cds

         seqid source type start end
1 NW_019942502 Gnomon  CDS     1   3   
2 NW_019942502 Gnomon  CDS    13  15   
3 NW_019942502 Gnomon  CDS    17  27  
4 NW_019942503 Gnomon  CDS     1  12   
5 NW_019942503 Gnomon  CDS    67  87   

And expected output :

> test_cds

         seqid source type start end sum_coverage
1 NW_019942502 Gnomon  CDS     1   3           49
2 NW_019942502 Gnomon  CDS    13  15           50

To do so, I'm trying to use something like dplyr to replace a for() loop that will be way too long :

for (i in 1:nrow(test_map)) {
  if (test_cov$seqid == test_cds$seqid & test_cov$pos >= test_cds$start & test_cov$pos <= test_cds$end) {
    test_cds$coverage <- sum(test_cov$cov)
  }
}

Many thanks !

Chloé

  • 2
    You can try non-equi-join in data.table. This answer I think does what you want https://stackoverflow.com/questions/64202974/sum-partial-datatable-as-column-for-another-datatable/ – Ronak Shah Oct 08 '20 at 15:06

1 Answers1

1

Here's a solution with dplyr and tidyr. We seq between start and end then use separate_rows to make the dataframe longer and create a pos column that we can join to test_cov. Once we're joined it's trivial to group_by and summarise.

library(dplyr)
library(tidyr)

newtest_cds <- 
   test_cds %>% 
   rowwise %>% 
   mutate(pos = paste(seq(start, 
                          end), 
                      collapse = ",")) %>%
   tidyr::separate_rows(pos) %>%
   mutate(pos = as.integer(pos))

newtest_cds
#> # A tibble: 50 x 6
#>    seqid        source type  start   end   pos
#>    <chr>        <chr>  <chr> <dbl> <dbl> <int>
#>  1 NW_019942502 Gnomon CDS       1     3     1
#>  2 NW_019942502 Gnomon CDS       1     3     2
#>  3 NW_019942502 Gnomon CDS       1     3     3
#>  4 NW_019942502 Gnomon CDS      13    15    13
#>  5 NW_019942502 Gnomon CDS      13    15    14
#>  6 NW_019942502 Gnomon CDS      13    15    15
#>  7 NW_019942502 Gnomon CDS      17    27    17
#>  8 NW_019942502 Gnomon CDS      17    27    18
#>  9 NW_019942502 Gnomon CDS      17    27    19
#> 10 NW_019942502 Gnomon CDS      17    27    20
#> # … with 40 more rows

cds_cov <- right_join(newtest_cds, test_cov)
#> Joining, by = c("seqid", "pos")

cds_cov %>% 
   group_by(seqid, source, type, start, end) %>%
   summarise(sum_coverage = sum(cov))


#> # A tibble: 2 x 6
#> # Groups:   seqid, source, type, start [2]
#>   seqid        source type  start   end sum_coverage
#>   <chr>        <chr>  <chr> <dbl> <dbl>        <dbl>
#> 1 NW_019942502 Gnomon CDS       1     3           49
#> 2 NW_019942502 Gnomon CDS      13    15           50

Your sample data

test_cds <- readr::read_table2("seqid source type start end
NW_019942502 Gnomon  CDS     1   3
NW_019942502 Gnomon  CDS    13  15
NW_019942502 Gnomon  CDS    17  27
NW_019942503 Gnomon  CDS     1  12
NW_019942503 Gnomon  CDS    67  87")

test_cov <- readr::read_table2("seqid  pos cov
NW_019942502   1  13
NW_019942502   2  16
NW_019942502   3  20
NW_019942502  13  16
NW_019942502  14  16
NW_019942502  15  18")

Chuck P
  • 3,862
  • 3
  • 9
  • 20