rolling join to find genomic regions

Question

The Quick version is that I want to find all matches within a certain distance of certain values. Example:

library(data.table)
reads <- structure(list(seqnames = c(1, 1, 1, 1, 1, 1, 1), start = c(100, 
100, 100, 130, 130, 132, 133), end = c(130, 130, 131, 160, 160, 
155, 160), strand = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "+", class = "factor")), row.names = c(NA, 
-7L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x15af570>, sorted = c("start", 
"end", "strand"))

# take only the end position
p3 <- reads[, list(N3 = .N), by = c("seqnames", "end", "strand")]
setnames(p3, "end", "loc.3p")

# take only the start position
p5 <- reads[, list(N5 = .N), by = c("seqnames", "start", "strand")]
setnames(p5, "start", "loc.5p")

# find the start positions close to the end positions
p3[, loc := loc.3p]
p5[, loc := loc.5p]

p3[p5, roll=50, rollends=c(FALSE, TRUE), nomatch=0, on = c("seqnames", "strand", "loc")]

   seqnames loc.3p strand N3 loc loc.5p N5
1:        1    130      +  2 130    130  2
2:        1    131      +  1 132    132  1
3:        1    131      +  1 133    133  1

This works, but only the first match is returned. My goal is now to find all matches (within the specified range of 50). Expected output:

   seqnames loc.3p strand N3 loc loc.5p N5
1:        1    130      +  2 130    130  2
2:        1    130      +  2 130    132  1
3:        1    130      +  2 133    133  1
4:        1    131      +  1 132    132  1
4:        1    131      +  1 133    133  1

Notice how loc.3p 130 should also have a match with loc.5p 132 and 132.

How to do that? I need to keep the scores for the next set of operations.

Bioinformatics version, I am trying to find all downstream 5' reads up to a certain distance (in the same strand). For this example I am only using the "+" strand, but I will also have to do it for "-" strand. Since this will be done for millions of reads, data.table seems to be appropriate. I looked into GenomicRanges but keeping the metadata for both sets of data (5' and 3' positions) is a tad convoluted.

zack · Accepted Answer · 2018-12-11T23:11:42.883

2

When you say

within the specified range of 50

It makes me think non-equi join. In the future it may be helpful if you show your desired result :)

slight modification to p3, note I skipped creating the loc column that you had.

p3[, "upper" := 50 + loc.3p]

Non-equi join:

p3[p5, .(seqnames, x.loc.3p, strand,  N3, loc.5p, N5),
   on = .(seqnames, strand, loc.3p <= loc.5p, upper >= loc.5p), nomatch = 0
   ][order(x.loc.3p, loc.5p),
     .(seqnames, loc.3p = x.loc.3p, strand, N3, loc.5p, N5)
     ]

   seqnames loc.3p strand N3 loc.5p N5
1:        1    130      +  2    130  2
2:        1    130      +  2    132  1
3:        1    130      +  2    133  1
4:        1    131      +  1    132  1
5:        1    131      +  1    133  1

The last bracket is for ordering and renaming, I'm not entirely sure why I had to specify x.loc.3p in the first brackets, but I got an error with out it...

edited Dec 11 '18 at 23:11

answered Dec 11 '18 at 16:04

zack

5,205
1
19
25

_In the future it may be helpful if you show your desired result :)_ You are absolutely right and that has been added to the post now. Your answer will probably solve the issue with some minor changes - the lower bound is not needed. I will test it further and accept the answer if all systems are go :) – fridaymeetssunday Dec 11 '18 at 19:17
No worries - I think I misunderstood what you meant by "specified range of 50", I imagined that meant `loc.5p` had to be within 50 units of `loc.3p` in order to be joined. Happy to contribute further if it's explained. – zack Dec 11 '18 at 19:27
All good. It is not clear that I meant downstream, or rather that 5p should have the same value or at most 50 more than 3p. – fridaymeetssunday Dec 11 '18 at 19:29
Great - my lack of genomics knowledge probably hurt me there. I've changed the call to get what I believe you're looking for. – zack Dec 11 '18 at 19:52
So, it seems to have mostly worked. A couple things had to be changed (i) `50L + loc.3p` because one of the matching columns is an integer and it was throwing an error, and (ii) there was a large allocation error, extreme multiple matches, which was solved with `allow.cartesian=TRUE`. It should be noted that the example in OP is very simplified. The actual data has a few million rows in both tables. – fridaymeetssunday Dec 11 '18 at 21:46

rolling join to find genomic regions

1 Answers1