Spark AQE drastically reduces number of partitions

Asked Mar 02 '23 at 14:43

Active Mar 02 '23 at 17:40

Viewed 36 times

I am using spark 3.2.1 to summarise high volume data using joins. Spark's plan shows that 1 executor was tasked with 90GB of data to process after Spark's AEQShuffleRead step as shown below. Also the shuffle partition of 900 was drastically brought down to 8.

I tried spark.sql.adaptive.coalescePartitions.enabled = false. Below is the plan

I thought the row_number() was causing the issue. But removing it still had a similar plan with AQE

, row_number() over (partition by payments_payment_reference, payments_payment_id, payments_payment_refund_reference,sale_or_refund
                         order by payments_payment_id,payments_payment_refund_reference, create_date_ymd) payments_fee_filter

The job writes a summary - so the output file size is 180MB in parquet. So, I am doing a repartition on the final step in order to be left with single file output.

Why is spark behaving like this? How can I overcome and distribute the load?

edited Mar 02 '23 at 17:40

asked Mar 02 '23 at 14:43

Gladiator

Spark AQE drastically reduces number of partitions

0 Answers0