I ran 3 tests on Spark 3.3.0:
X. If spark.sql.autoBroadcastJoinThreshold
is set to 2GB and AQE is disabled, runtime = 30 minutes
Y. If spark.sql.autoBroadcastJoinThreshold=-1
(disabled) and AQE is disabled, runtime = 5.5 hours.
Z. If spark.sql.autoBroadcastJoinThreshold=-1
and AQE is enabled with skew join optimization, runtime = 1 hour
I ran the above tests to test out the benefits of skew join optimization. I was aiming to force skew joins via disabling autobroadcast, since broadcast joins naturally handle data skew. Some questions:
- Is Y being slower than X a 100% indicator that data was skewed since broadcast joins are supposed to handle data skew, or is there little/no correlation between the two?
- Is Z being faster than Y showing us that skew join optimization helped improve runtime by almost 6x? I would assume this is only true if 1 were true. Otherwise, it might've just improved via some other aspects of AQE?