I have an issue with R, trying to code something for very large tables.
I'm trying to do, for each line in test_cds and for corresponding positions, the sum of the coverage $cov in test_cov table.
For example, for test_cds line 1 :
seqid source type start end
1 NW_019942502 Gnomon CDS 1 3
positions between 1 and 3 include, use :
> test_cov
seqid pos cov
1 NW_019942502 1 13
2 NW_019942502 2 16
3 NW_019942502 3 20
and do : sum(cov) for pos 1,2,3 = 13+16+20 in order to output :
> test_cds
seqid source type start end sum_coverage
1 NW_019942502 Gnomon CDS 1 3 49
Warning : $pos range from 1 to +++ for each $seqid.
Here's my input tables :
> test_cov
seqid pos cov
1 NW_019942502 1 13
2 NW_019942502 2 16
3 NW_019942502 3 20
(...)
4 NW_019942502 13 16
5 NW_019942502 14 16
6 NW_019942502 15 18
> test_cds
seqid source type start end
1 NW_019942502 Gnomon CDS 1 3
2 NW_019942502 Gnomon CDS 13 15
3 NW_019942502 Gnomon CDS 17 27
4 NW_019942503 Gnomon CDS 1 12
5 NW_019942503 Gnomon CDS 67 87
And expected output :
> test_cds
seqid source type start end sum_coverage
1 NW_019942502 Gnomon CDS 1 3 49
2 NW_019942502 Gnomon CDS 13 15 50
To do so, I'm trying to use something like dplyr to replace a for() loop that will be way too long :
for (i in 1:nrow(test_map)) {
if (test_cov$seqid == test_cds$seqid & test_cov$pos >= test_cds$start & test_cov$pos <= test_cds$end) {
test_cds$coverage <- sum(test_cov$cov)
}
}
Many thanks !
Chloé