Understanding Spark UI details for a small pyspark snippet

Question

I am trying to understand the job, stage and task details shown in SparkUI for the following pyspark snippet

# Disabling adaptive execution to see 200 shuffle partitions show up
spark.conf.set('spark.sql.adaptive.enabled', 'false')
df = spark.createDataFrame(pd.DataFrame([[1, 2, 3],[4,5,6]], columns=['a','b','c']))
df.groupBy('a').agg({'b': 'sum'}).withColumnRenamed('sum(b)', 'b').orderBy(F.desc('b')).collect()

This is making sense as I am noticing two shuffle exchanges corresponding to group-by and order-by operations. I am guessing HashAggregate in the first block (before the first shuffle) is a local aggregation done by each task (like combiner in map-reduce). Is that correct?

For job & stage level details I have several questions.

Why there are two jobs as there is only one action collect?

Stage-5 & 6 (job-3) appears to be transformation for group-by. If that is the case then I can't make sense of Stage 7,8,9 (job-4) as I am seeing two more exchanges there. Am I not understanding it correctly?
I am also expecting output of stage-6 to be fed into stage-8 instead stage-7 becoming the predecessor of stage-8. Also what is the role of parallelize->mapPartitions...-> in stage-7? What transformation (in pyspark snippet) does it correspond to?
Is it getting skipped because corresponding data is found in cache?

Any help will be really appreciated.

Understanding Spark UI details for a small pyspark snippet

0 Answers0