0

I have DF1 with ~50k records. DF2 has >5Billion records from s3 parq. I need to do a left outer join on md5 hash in both DFs but as expected it's slow and expensive.

I tried broadcast join but DF1 is quite big as well.

I was wondering what would be the best way to handle this. Should I filter DF2 on those 50k records (md5s) first and then do the join with Df1.

Thanks.

OneWorld
  • 952
  • 2
  • 8
  • 21
  • 1
    You can try to filter first. Anyway, the dataset from the left (the smaller one) should gain the left join optimization versus the large table. but since its dramatically bigger, your suggestion may help here. – Benny Elgazar Mar 02 '22 at 09:16
  • 1
    what is slow for you ? how much time does it take? – Steven Mar 02 '22 at 09:16
  • 1
    Also, as steven mentioned, it's subjective to say "slow" and it also depends on the size of the cluster. assuming the 5b can't fit memory cache. if it can it can improve your join output result. but 5b is pretty large, Im assuming your cluster is not that big, can you give us some specifications ? – Benny Elgazar Mar 02 '22 at 09:19
  • 1
    It seems to me that 50K is not so big for broadcasting. Dou you have any error? – Emiliano Martinez Mar 02 '22 at 09:19
  • @Steven: Its not slow per say. If i 80 nodes (32gb mem, 8 vcpu), it takes around 30 mins. And since I need to run it every hour * 24 * 365, it's very expensive. – OneWorld Mar 02 '22 at 09:33
  • @EmilianoMartinez: I am using serverless spark with AWS Glue. 80 G2X nodes. 5 mins into the broadcast join, i get errors below: 1. Job 0 cancelled because SparkContext was shut down caused by threshold for executors failed after launch reached; or 2. BrokenPipeError: [Errno 32] Broken pipe I thoughts its related to memory, so I set --conf spark.driver.maxResultSize=0 --conf spark.sql.broadcastTimeout=1800 --conf spark.driver.memoryOverhead=16g --conf spark.executor.memoryOverhead=16g --conf spark.driver.memory=8g --conf spark.executor.memory=8g No luck with this – OneWorld Mar 02 '22 at 09:35
  • @BennyElgazar: So the cluster is 160DPU or 80 G2x workers (1 executor per worker), 8vcpu/32GB workers. Currently takes around 30 mins – OneWorld Mar 02 '22 at 09:41
  • Can you see the logs of the executor?, to see if there is an out of memory?. How many columns has the dataframe?, maybe if it is very large there it can not persist in every executor with that memory. – Emiliano Martinez Mar 02 '22 at 10:29

1 Answers1

1

You could try out the following options:

  1. Remove duplicate rows andFilter out any and all columns that are not relevant for downstream processing before calling on the outer join

  2. Persist DF1 to disk before calling the join operation

  3. Figure out if DF2 has data skew on the columns used for joining, then try to split them out based on skew values and process the join separately

  4. You could also try tweaking the execution parameters to tune the job