0

I'm trying to do a non equi self join with data.table on a table that has 2 millions rows and 8 variables. the data look like this:

db table :
product     position_min     position_max      count_pos
A.16        167804              167870              20
A.18        167804              167838              15
A.15        167896              167768              18
A.20        238359              238361              33
A.35        167835              167837              8

dt table:
product_t   position_min_t     position_max_t      count_pos_t
A.16        167804              167870              20
A.18        167804              167838              15
A.15        167896              167768              18
A.20        238359              238361              33
A.35        167835              167837              8

Here's the code I used:

db_join <- db[dt, .(product, product_t, position_min_t, position_max_t, count_pos_t), on = .(position_min <= position_min_t, position_max >=  position_max_t)]

I should get:

A16        A18          167804              167838              15
A16        A15          167896              167768              18
A16        A35          167835              167837              8
A18        A35          167835              167837              8

but I keep getting this error

Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
  Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

I did set the allow.cartesian to TRUE and added by=.EACHI, it still doesn't work. I tried the same code on a subset of the table with 1.6 millions rows, and it worked like a charm. Have you any idea on how to fix it ? Any help would be much appreciated

Stella
  • 69
  • 1
  • 10
  • Setting `allow.cartesian=TRUE` does not fix too-large joins, it is just a mechanism for you to tell `data.table` that you are okay with "explosive row counts". When you ran it on 1.6M rows input, how many rows did you get out? Is there something about the removed 400K rows that would significantly change the join geometry? – r2evans Jun 17 '21 at 11:45
  • Thank you for your reponse @r2evans. I get 28 M rows when runing the join on the 1.6 M subset. the 400K lines are no different than the 1.6 M. I don't see any difference between the 400K and the 1.6 M . – Stella Jun 17 '21 at 13:23
  • While I recognize it won't error the same way with smaller data, is it possible to provide representative sample data to verify what the join mechanics are doing? – r2evans Jun 18 '21 at 02:44
  • of course, i'll edit my question and add a sample of the data – Stella Jun 18 '21 at 08:22
  • Even if this sample does not generate that error, it also does not meet your expectation, returning 11 rows instead of your 4. Based on your output, I'm inferring adding `[product != product_t,]` after the join would help, but that still produces pairs you don't have, namely `A.18`-`A.15` and `A.35`-`A.15`, both of which meet the other constraints. Can you expand on your rules as to why those two rows should be excluded? (BTW, removing where `product == product_t` is not going to ameliorate your error, since that's a post-join reduction, not sure how best to accommodate that.) – r2evans Jun 18 '21 at 16:46

0 Answers0