I am running a job where I combine wikidaa and wikipedia pageviews and I am using a small google cluster of two to three nodes. My problem is that most of the times one node is totally idle although I have tried to increase the parallelism by partitioning the data in many partitions prior to starting the job. In addition I repartition the data depending on the parallelism parameter of Spark, but no matter what I try only one node is always in use.
My last effort was the following script which did not do much. It increased the performance of the working node but the idle node remained idle.
The script I use to run the job is the following:
gcloud dataproc clusters create mycluster \
--zone europe-west1-b \
--master-machine-type n1-standard-8 \
--master-boot-disk-size 500 \
--num-workers 2 \
--worker-machine-type n1-standard-16 \
--worker-boot-disk-size 500 \
--scopes 'https://www.googleapis.com/auth/cloud-platform' \
--project myproject
gcloud dataproc jobs submit spark --cluster mycluster \
--class Main \
--properties \
spark.driver.memory=38g,\
spark.driver.maxResultSize=1g,\
spark.executor.memory=45g,\
spark.driver.cores=4,\
spark.executor.cores=16,\
spark.dynamicAllocation.enabled=true,\
spark.shuffle.service.enabled=true,\
spark.dynamicAllocation.minExecutors=32,\
spark.executor.heartbeatInterval=36000s,\
spark.network.timeout=86000s,\
spark.default.parallelism=1000,\
spark.driver.extraJavaOptions=-XX:+UseConcMarkSweepGC,\
spark.executor.extraJavaOptions=-XX:+UseConcMarkSweepGC \
--files /path/to/file/properties.properties \
--jars myjar.jar
customArg1=value1
flagA=false
flagB=true