I have a large dataset and would like to form all pairs of rows satisfying some condition and then calculate some variables based on which parts of the condition were satisfied. The following MWE illustrates what I would like to achieve:
library(data.table)
set.seed(1234)
IDs <- data.table(id = letters[1:10],
c1 = sample(1:5, 10, replace = T),
c2 = sample(1:5, 10, replace = T),
c3 = sample(1:5, 10, replace = T),
c = 1)
IDs.joined <- IDs[IDs, on = 'c', allow.cartesian = T
][c1 != i.c1 & (c2 == i.c2 | c3 == i.c3) # condition defining which pairs are joined
][, c('Ic2', 'Ic3') := .(c2 == i.c2, c3 == i.c3)
][, overlap_id := fifelse(Ic2 == 1, 2, 3)
][, overlap := Ic2 + Ic3
][, -c('i.c1', 'i.c2', 'i.c3', 'Ic2', 'Ic3')]
The problem is that the full dataset is way too large (~5 million rows) to form the Cartesian join on itself. My question is, is there a way to use data.table
's syntax to perform a conditional join like this directly, without going via the Cartesian join first and imposing the desired condition second?
I have seen similar problems on SO but these can typically be expressed as a rolling join, I am not aware of a way to include X | Y
statements in the rolling join syntax, or X != Y
conditions.