As per my understanding, there will be one job for each action in Spark. But often I see there are more than one jobs triggered for a single action. I was trying to test this by doing a simple aggregation on a dataset to get distinct values for dep column in my table. I am using Spark 3+ version to test this out
Following is my query
sqlContext.sql("CREATE TABLE emp (id INT, dep STRING)")
sqlContext.sql("INSERT INTO emp VALUES (1, 'hr')")
sqlContext.sql("INSERT INTO emp VALUES (2, 'eng')")
sqlContext.sql("INSERT INTO emp VALUES (6, 'facility')")
Explain Query output for above query
sqlContext.sql("select dep from emp GROUP by dep").explain
== Physical Plan ==
*(2) HashAggregate(keys=[dep#31], functions=[])
+- Exchange hashpartitioning(dep#31, 200), ENSURE_REQUIREMENTS, [plan_id=155]
+- *(1) HashAggregate(keys=[dep#31], functions=[])
+- Scan hive default.emp [dep#31], HiveTableRelation [`default`.`emp`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [empid#30, dep#31], Partition Cols: []]
I have also set following settings for spark job configuration
spark.conf.set("spark.sql.adaptive.enabled",false)
spark.conf.set("spark.sql.shuffle.partitions",200)
I was trying to understand the no of tasks created by each job starting from bottom
For JobID 13 there are 8 tasks shown on UI, I checked there are 7 files exists on HDFS for this table as I have inserted total 7 records, hence I can understand 7 tasks are created to read each file.
For others Jobs with id (14,15,16,17) I see (4,20,100,75) tasks respectively. I don't understand how spark is getting this much of tasks for each Job. As per my understanding, there should be total 200 tasks created for other jobs which is driven by setting "spark.sql.shuffle.partitions"
On examining the Spark UI, I can see there are 5 "jobs" executed for the groupBy operation, while I was expecting just 3(1 for reading, 1 for aggregation, 1 for show).
Can anyone help me by answering above questions