I'm provisioning a Google Cloud Dataproc cluster in the following way:
gcloud dataproc clusters create spark --async --image-version 1.2 \
--master-machine-type n1-standard-1 --master-boot-disk-size 10 \
--worker-machine-type n1-highmem-8 --num-workers 4 --worker-boot-disk-size 10 \
--num-worker-local-ssds 1
Launching a Spark application in yarn-cluster
mode with
spark.driver.cores=1
spark.driver.memory=1g
spark.executor.instances=4
spark.executor.cores=8
spark.executor.memory=36g
will only ever launch 3 executor instances instead of the requested 4, effectively "wasting" a full worker node which seems to be running the driver only. Also, reducing spark.executor.cores=7
to "reserve" a core on a worker node for the driver does not seem to help.
What configuration is required to be able to run the driver in yarn-cluster
mode alongside executor processes, making optimal use of the available resources?