I have a dataset that is very large 350 million to 1 billion records depending on the batch. On the right side I have a much smaller data set usually in size of 10 million or so, not more. I cannot simply broadcast the right side (sometimes it grows beyond 8BG which is a hard limit). And to top it my left side has power law distribution on the join key.
I have tried doing the trick of randomly exploding the right side key by adding a random salt in order to battle power law distribution on the left side.
This works as intended but for occasional batch I get container failure with memory exceeding limit (19.5 GB out of 19GB). I can only go as far as 17GB + 2GB overhead per executor. I tried reducing cores in order to have more memory per thread but still same problem happens. The issue happens 2 or 3 times per 50 or so batches. Same batch runs correctly when job is restarted from the point of failure.
Right side of the join is produced by joining small date to medium size data via broadcast join and the larger side of this join is is checkpoint-ed in order to save time if errors occur.
val x = larger.join(broadcast(smaller), Seq("col1", "col2", ...), "left")
The result is obtained by joining very large data to x.
val res = very_large.join(x, Seq("col2", "col3", ....), "left_outter").where(condition)
My question is whether re-enabling (dissabled by default) shuffle hash join would be a better option in this case. My understanding is that given my right side is so much smaller than left side of the join that shuffle boradcast join could be a better option than sort merge join (which is enabled by default).
I use spark 2.3 (cant upgrade due to platform constraints). I do have some custom catalyst expressions but they have been tested and they dont crash out in other jobs. // I am listing this only for sake of complicity.
Note: I cannot paste code samples due to IP.