I'm trying to do a non equi self join with data.table on a table that has 2 millions rows and 8 variables. the data look like this:
db table :
product position_min position_max count_pos
A.16 167804 167870 20
A.18 167804 167838 15
A.15 167896 167768 18
A.20 238359 238361 33
A.35 167835 167837 8
dt table:
product_t position_min_t position_max_t count_pos_t
A.16 167804 167870 20
A.18 167804 167838 15
A.15 167896 167768 18
A.20 238359 238361 33
A.35 167835 167837 8
Here's the code I used:
db_join <- db[dt, .(product, product_t, position_min_t, position_max_t, count_pos_t), on = .(position_min <= position_min_t, position_max >= position_max_t)]
I should get:
A16 A18 167804 167838 15
A16 A15 167896 167768 18
A16 A35 167835 167837 8
A18 A35 167835 167837 8
but I keep getting this error
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.
I did set the allow.cartesian
to TRUE and added by=.EACHI
, it still doesn't work.
I tried the same code on a subset of the table with 1.6 millions rows, and it worked like a charm.
Have you any idea on how to fix it ?
Any help would be much appreciated