Spark Scheduler pool jobs are not running parallel as I expected

Question

I am trying to run two spark actions as below and I expect them to run parallely as they both use differenct pools. Does scheduling using pools meant that, different independent actions will run parallelly? I mean If I have 200 cores, then pool1 uses 100 cores and pool2 uses 100 cores and then process the action. In my case after first dataframe action is completed in pool1 then dataframe action2 is started.

spark.setLocalProperty("spark.scheduler.pool","pool1")
dataframe.show(100,false)

spark.setLocalProperty("spark.scheduler.pool","pool2")
dataframe2.show(100,false)

My pool configuration xml

<?xml version="1.0"?>

<allocations>
  <pool name="pool1">
    <schedulingMode>FAIR</schedulingMode>
    <weight>1</weight>
  </pool>
  <pool name="pool2">
    <schedulingMode>FAIR</schedulingMode>
    <weight>1</weight>
  </pool>
</allocations>

Have you set conf property spark.scheduler.allocation.file to pool configuration xml? conf.set("spark.scheduler.allocation.file", "/path/to/file") — Anurag Sharma, Mar 22 '19 at 09:20

score 0 · Answer 1 · answered Mar 22 '19 at 05:43

As per given details, your job must run parallely based on spark configuration but there are few parameters which need to be considered,

Is YARN your cluster manager ? and if it is then have you configured the pool in configuration in YARN.
I can see you are using FAIR scheduler which means scheduler is being overridden then have configured the same in YARN ?

TO configured FAIR scheduler please go through below link, everything is given in details, http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html

Spark Scheduler pool jobs are not running parallel as I expected

1 Answers1