1

I ran 3 tests on Spark 3.3.0:

X. If spark.sql.autoBroadcastJoinThreshold is set to 2GB and AQE is disabled, runtime = 30 minutes

Y. If spark.sql.autoBroadcastJoinThreshold=-1 (disabled) and AQE is disabled, runtime = 5.5 hours.

Z. If spark.sql.autoBroadcastJoinThreshold=-1 and AQE is enabled with skew join optimization, runtime = 1 hour

I ran the above tests to test out the benefits of skew join optimization. I was aiming to force skew joins via disabling autobroadcast, since broadcast joins naturally handle data skew. Some questions:

  1. Is Y being slower than X a 100% indicator that data was skewed since broadcast joins are supposed to handle data skew, or is there little/no correlation between the two?
  2. Is Z being faster than Y showing us that skew join optimization helped improve runtime by almost 6x? I would assume this is only true if 1 were true. Otherwise, it might've just improved via some other aspects of AQE?
Koedlt
  • 4,286
  • 8
  • 15
  • 33

1 Answers1

1
  1. False: broadcast joins are NOT supposed to handle data skew, the aim of broadcast joins is to improve the performance of a join operations by reducing shuffle, basically when broadcasting a dataframe a copy of whole that dataframe will be inside each executor, so the joins will be performed inside each executor without the need to shuffle.

    The reason why X is faster than Y is that you have enough memory to broadcast till 2GB, by disabling the join broadcast by setting it to -1 you will lose the power of broadcast, then the default SortMergeJoin will be the default strategy which will introduce shuffle that's why it takes longer

  2. True: it seems that your data is skewed, that's why it's faster with AQE, to make sure if your data is skewed or not using Spark UI, you can use this post: Apache Spark: How to detect data skew using Spark web UI

Abdennacer Lachiheb
  • 4,388
  • 7
  • 30
  • 61