I'm using the quickstart cloudera VM (CDH 5.10.1) with Pyspark (1.6.0) and Yarn (MR2 Included) to aggregate numerical data per hour. I've got 1 CPU with 4 cores and 32 Go of RAM.
I've got a file named aggregate.py but until today I never submitted the job with spark-submit
, I used pyspark
interactive shell and copy/paste the code to test it.
When starting pyspark interactive shell I used :
pyspark --master yarn-client
I followed the treatment in the web UI accessible at quickstart.cloudera:8088/cluster and could see that Yarn created 3 executors and 1 driver with one core each (Not a good configuration but the main purpose is to make a proof of concept, until we move to a real cluster)
When submitting the same code with spark-submit :
spark-submit --verbose
--master yarn
--deploy-mode client \
--num-executors 2 \
--driver-memory 3G \
--executor-memory 6G \
--executor-cores 2 \
aggregate.py
I only have the driver, which also executes the tasks. Note that spark.dynamicAllocation.enabled
is set to true in the environment tab, and spark.dynamicAllocation.minExecutors
is set to 2.
I tried using spark-submit aggregate.py
only, I still got only the driver as executor. I can't manage to have more than 1 executor with spark-submit, yet it works in spark interactive shell !
My Yarn configuration is as follow :
yarn.nodemanager.resource.memory-mb
= 17 GiB
yarn.nodemanager.resource.cpu-vcores
= 4
yarn.scheduler.minimum-allocation-mb
= 3 GiB
yarn.scheduler.maximum-allocation-mb
= 16 GiB
yarn.scheduler.minimum-allocation-vcores
= 1
yarn.scheduler.maximum-allocation-vcores
= 2
If someone can explain me what I'm doing wrong it would be a great help !