2

I am unable to do a basic non equi join in two data.tables in R without the error: argument specifying columns specify non existing column(s): cols[2]='abs(x.val - i.val)'

A min. example to show the error.

library(data.table)
set.seed(1); dt1 <- data.table(id= sample(letters[1:3],size = 10,replace = T),val = rnorm(10,mean = 5,sd = 2))
set.seed(2); dt2 <- data.table(id= sample(letters[1:3],size = 10,replace = T),val = rnorm(10,mean = 5,sd = 2))
dt1[dt2,on = .(id,abs(x.val - i.val) <1),nomatch=0]
#> Error in colnamesInt(x, names(on), check_dups = FALSE): argument specifying columns specify non existing column(s): cols[2]='abs(x.val - i.val)'

Created on 2022-05-17 by the reprex package (v2.0.1)

It must be something fundamental I am missing.

Note: I have deliberately kept column names identical in both data tables to test the x. and i. functionality of data.table.

Lazarus Thurston
  • 1,197
  • 15
  • 33
  • 2
    I'm pretty sure you can't use a variable calculation in the `on=` argument. You can get *a* result if you break it out so you're just joining on `id` - `dt1[dt2,on = .(id), abs(x.val - i.val) <1, nomatch=0, allow.cartesian=TRUE]` - not sure what your intended output is however. – thelatemail May 17 '22 at 11:16

1 Answers1

4

In two steps as on only accepts =,< or > :

dt1[dt2,on = .(id),nomatch=0,allow.cartesian=T][abs(val - i.val)<1]

   id      val    i.val
1:  a 3.142866 3.175512
2:  b 4.421077 3.494669
3:  b 4.401570 3.494669
4:  a 4.988466 4.428577
5:  a 3.142866 2.931430
6:  a 4.988466 4.943710
7:  a 4.988466 4.267357
8:  a 3.142866 2.769503

Or to avoid cartesian product:

dt2[,`:=`(val_min=val-1,val_max=val+1)]
dt1[dt2,.(id,x.val,i.val),on = .(id,val>val_min,val<val_max),nomatch=0]

   id    x.val    i.val
1:  a 3.142866 3.175512
2:  b 4.421077 3.494669
3:  b 4.401570 3.494669
4:  a 4.988466 4.428577
5:  a 3.142866 2.931430
6:  a 4.988466 4.943710
7:  a 4.988466 4.267357
8:  a 3.142866 2.769503
Waldi
  • 39,242
  • 6
  • 30
  • 78
  • 1
    Your answer works for small data.tables, but when each data.table has tens of thousands of rows and the id column matches 100s of times, the resultant number of rows soon gets bloated up, before even reaching the filter step 2. Is there a way to limit the memory blast? – Lazarus Thurston May 17 '22 at 11:29
  • I would create `val_min:=val-1`,`val_max:=val+1` one one of the tables and then use `foverlaps` on interval `val_min`-`val_max` vs `val`-`val` – Waldi May 17 '22 at 11:34
  • Instead of `foverlaps` what if I use your above suggestion in non.equi join itself? So we do `on = .(id,val_max > i.val, val_min – Lazarus Thurston May 17 '22 at 11:42
  • 1
    This works. Just add `id` in the J columns list, and we are ready to go. – Lazarus Thurston May 17 '22 at 12:01