R data.table non-equi join based on "not equal"

Question

I would like to merge the data.table to itself based on values where id's are not equal. Here is a small example:

library(data.table)

#Two tables:
dt_1 <- data.table(id = c(1,2,3),x = c(2,3,4))
dt_2 <- copy(dt_1) %>% 
  setnames(c("id_new","x_new"))

#Calculation:
dt_2 <- dt_1[,as.list(dt_2),by = c("id")]
dt_2 <- merge(dt_2,dt_1,by = c("id"),all.x = TRUE)
dt_2[id!=id_new]

Maybe somebody can come up with a better solution.

Please correct me where I'm wrong, but when I run your code, in the final `dt_2` there are nine rows, three of which are true with `id == id_new`. — r2evans, Jun 05 '22 at 13:56
Perhaps `as.data.table(merge.data.frame(dt_1, dt_2, by=c()))[ (id != id_new), ]`? It's a little inefficient since it does a cartesion expansion first and then removes the rows, but the result is likely fairly close to that anyway. — r2evans, Jun 05 '22 at 13:56
@r2evans, this is correct, I can filter them out later. I would be interested to see both with id == id_new and without. — Vitalijs, Jun 05 '22 at 13:57
Why do you say *"on values where id's are not equal"*? That seems specifically counter to having both `==` and `!=`. Perhaps all you need is `as.data.table(merge.data.frame(dt_1, dt_2, by=c()))`. — r2evans, Jun 05 '22 at 13:58
@r2evans, this looks great but I think this would be slow if dt_1 and dt_2 are big tables. Or I am wrong? I was thinking about non-equi join but could not find not equal operator there. — Vitalijs, Jun 05 '22 at 13:58
You say you want where `id == id_new` and where `id != id_new`, and since you have no other join columns, this means you want a true cartesian expansion. Your resulting frame will have `nrow(dt_1) * nrow(dt_2)` rows, period. If your data is large, then you likely need to come up with a different way to do what you ultimately need to do. — r2evans, Jun 05 '22 at 14:00
@r2evans, thanks, the dimensionality problem is clear. Are there other ways to do it? Later I was thinking of adding other conditions for join. This is a toy example. — Vitalijs, Jun 05 '22 at 14:01
I'm interpreting that you need to compare joined rows where both `id == id_new` and `id != id_new`, which means that neither `id` nor `id_new` are join columns. To me, this means you are attempting to do a cartesian-expansion. There is no way to do this without resulting in the multiplicative number of rows, that is its definition. — r2evans, Jun 05 '22 at 14:04
I am absolutely confused on what you want. Your question starts with *"where id's are not equal"*, ergo the code in my second comment. Then you say *"interested to see both with id == id_new and without"*, which defies the first sentence in your question as well as the results after your `dt_2[id!=id_new]`. Finally, I'm trying to confirm the deduction that by needing both `id==id_new` and `id!=id_new`, you are looking to investigate a true expansion without join-keys, meaning that your perceived need is neither `data.table`-efficient nor large-data-feasible. — r2evans, Jun 05 '22 at 14:09
Regardless, even if we dismiss wanting to see `id == id_new` and strictly go with seeing `!=`, that is still a nearly-full expansion, which is still an issue with large data. I suggest you may need to do your exploration one `id` at a time. Good luck! — r2evans, Jun 05 '22 at 14:10
by=.EACHI to compute and pack results during cartesian without materializing it, although not sure if it will apply here to != — jangorecki, Jun 05 '22 at 16:38

score 0 · Answer 1 · answered Jun 06 '22 at 16:49

A few options below with benchmarking. The best seems to be subsetting the 2-permutations of row indices where id is different (f1).

library(data.table)
library(microbenchmark)

# column-bind based on all index pairs where the id values are different
f1 <- function(dt) {
  setnames(
    CJ(1:nrow(dt), 1:nrow(dt))[dt$id[V1] != dt$id[V2], setDT(c(dt[V1], dt[V2]))],
    c(colnames(dt), paste0(colnames(dt), "_new"))
  )
}

# cartesian join, filter during the join
fsub <- function(dt, i) as.list(dt[id != i])
f2 <- function(dt) setnames(dt[, fsub(dt, id), id:x], c(colnames(dt), paste0(colnames(dt), "_new")))

# cartesian join, filter after the join
f3 <- function(dt) setnames(dt[, as.list(dt), id:x], c(colnames(dt), paste0(colnames(dt), "_new")))[id != id_new]

# combining two non-equi joins
f4 <- function(dt) {
  rbindlist(
    list(
      dt[
        dt,
        .(id = x.id, x = x.x, id_new = i.id, x_new = i.x),
        on = .(id < id),
        allow.cartesian = TRUE,
        nomatch = 0
      ],
      dt[
        dt,
        .(id = x.id, x = x.x, id_new = i.id, x_new = i.x),
        on = .(id > id),
        allow.cartesian = TRUE,
        nomatch = 0
      ]
    )
  )
}

# OP data
dt <- data.table(id = 1:3, x = 2:4)
ldt <- list(f1(dt), f2(dt), f3(dt), f3(dt), setorder(f4(dt), id, x, id_new))
identical(ldt[-1], ldt[-length(ldt)])
#> [1] TRUE

# a bigger dataset
dt <- data.table(id = rep(1:25, each = 4), x = 1:100)
ldt <- list(f1(dt), f2(dt), f3(dt), f3(dt), setorder(f4(dt), id, x, id_new))
identical(ldt[-1], ldt[-length(ldt)])
#> [1] TRUE

microbenchmark(f1 = f1(dt),
               f2 = f2(dt),
               f3 = f3(dt),
               f4 = f4(dt))
#> Unit: milliseconds
#>  expr    min      lq      mean  median       uq     max neval
#>    f1 1.1789 1.34895  1.934972 1.56705  2.06485 12.7612   100
#>    f2 8.4337 9.23755 10.398676 9.89295 10.69075 14.6312   100
#>    f3 1.7287 1.96755  2.457754 2.29990  2.66350  6.1247   100
#>    f4 2.1832 2.42430  2.788016 2.58395  2.83860  5.6919   100

# an even bigger one
dt <- data.table(id = rep(1:10, each = 100), x = 1:1000)
ldt <- list(f1(dt), f2(dt), f3(dt), f3(dt), setorder(f4(dt), id, x, id_new))
identical(ldt[-1], ldt[-length(ldt)])
#> [1] TRUE

microbenchmark(f1 = f1(dt),
               f2 = f2(dt),
               f3 = f3(dt),
               f4 = f4(dt))
#> Unit: milliseconds
#>  expr     min        lq      mean    median        uq      max neval
#>    f1 14.3543  16.32115  20.59036  19.07030  22.01220  60.0190   100
#>    f2 93.5400 100.53905 108.82218 104.81745 111.09415 145.3390   100
#>    f3 22.0357  28.98800  36.93655  33.04950  36.53275  80.2040   100
#>    f4 14.8955  18.53780  23.03147  21.26855  24.63430  56.4298   100

R data.table non-equi join based on "not equal"

1 Answers1