R, join within a range vectorised

Question

I'm trying to join two datasets where a variable (or position along a genome) in one dataset fits within a range in the second (gene start/stop position). However, positions are not unique, but nested within an additional column (chromosome). The same goes for the gene start/stop positions. My goal is to link each position with the corresponding annotation and effect.

For example:

library(sqldf)
set.seed(100)
a <- data.frame(
    annotation = sample(c("this", "that", "other"), 3, replace=TRUE),
    start = seq(1, 30, 10),
    chr = sample(1:3, 3, replace=TRUE)
  )
a$stop <- a$start + 10
b <- data.frame(
    chr = sample(1:3, 3, replace=TRUE),
    position = sample(1:15, 3, replace=TRUE),
    effect = sample(c("high", "low"), 3, replace=TRUE)
  )

An SQL inner join gets me part of the way there:

df<-sqldf("SELECT a.start, a.stop, a.annotation, b.effect, b.position
    FROM a, b
    inner JOIN a b on(b.position >= a.start and b.position <= a.stop);")

But this doesn't account for the repetition of position per chromosome. I'm having conceptual trouble wrapping this into a loop or apply function.

I'm not wedded to SQL, it's just how I tackled a simpler problem previously. I'm also not sure that making an additional index column is appropriate since I have thousands of chromosome values.

My desired output would look like the following:

    df$chr<-c("NA","2","2")
      start stop annotation effect position chr
1     1   11       this   high        3  NA
2     1   11       this   high       10  NA
3    11   21       this    low       14   2

Where each position has been placed between the start and stop points on the correct chr, or given NA where no points on a chr match.

I'm sorry, I missed the set.seed when I re-ran the code. The chr column is dummy data I've appended just to show what I'm after. I appreciate your help. — smm, May 10 '16 at 20:47
@smm sorry but I still don't understand what's going on - the `chr` column in desired result is mysterious - I don't understand its relationship with `a$chr` and `b$chr` — eddi, May 10 '16 at 20:57
thanks @eddi The 'position','start' and 'stop' are nested within the 'chr' number, so it should be common across both datasets. — smm, May 10 '16 at 21:33
@smm what is the origin of e.g. lines 1:3 in `df`, i.e. how do you get them from `a` and `b`? — eddi, May 10 '16 at 21:36
@eddi "df" line 2 takes a "position" of 3 from from "b", locates it between the "start"/"stop" of "a" and links the "annotation" and "effect" information. Fine as it is but I need to repeat this process independently for each "chr" value. Sorry if I'm not making this clear. — smm, May 10 '16 at 21:43
@smm the `chr` column is the part that I don't understand - where are those repeated 3, 1, 2 coming from? — eddi, May 10 '16 at 21:46
@eddi My fault- I just made up the data for the `chr` column. Fixed now. The first three would be NA as the position doesn't map to the correct chromosome. — smm, May 10 '16 at 22:01
@smm why are there 3 identical rows repeated for each position? — eddi, May 10 '16 at 22:26
@eddi I think it's something going wrong with my sql query. I don't need the repeats but haven't been able to get around that yet. — smm, May 10 '16 at 22:59

eddi · Answer 1 · 2016-05-11T22:19:26.530

4

The development version of data.table introduces non-equi joins, allowing for:

library(data.table)
setDT(a) # converting to data.table in place
setDT(b)

b[a, on = .(position >= start, position <= stop), nomatch = 0,
  .(start, stop, annotation, effect, x.position, chr = ifelse(i.chr == x.chr, i.chr, NA))]
#   start stop annotation effect x.position chr
#1:     1   11       this   high          3  NA
#2:     1   11       this   high         10  NA
#3:    11   21       this    low         14   2

edited May 11 '16 at 22:19

answered May 11 '16 at 16:34

eddi

49,088
6
104
155

1

That's an exciting data.table development! – Gregor Thomas May 11 '16 at 16:36
Thanks @eddi, I realise now my desired output was incoherent. This is great though- I'll check out data.table again. Thanks for taking the time on this question- this answer would work if I then subset on the `chr` column. – smm May 11 '16 at 22:15

Gregor Thomas · Accepted Answer · 2016-05-11T23:30:29.310

2

I think this is what you're after:

sqldf(
    "Select start, stop, annotation, effect, position,
    case when a.chr = b.chr then a.chr else NULL end as chr
    from b left join a
    on b.position between a.start and a.stop
    "
)
#   start stop annotation effect position chr
# 1     1   11       this   high        3  NA
# 2     1   11       this   high       10  NA
# 3    11   21       this    low       14   2

edited May 11 '16 at 23:30

answered May 11 '16 at 16:32

Gregor Thomas

136,190
20
167
294

1

this is exactly what I was after. I'll change the `chr` desired output for the sake of future readers. That was painful for you and @eddi, but it's helped me out. I'll make sure my future posts don't require telepathy. – smm May 11 '16 at 22:10

R, join within a range vectorised

2 Answers2

Related