0

I am running a job where I combine wikidaa and wikipedia pageviews and I am using a small google cluster of two to three nodes. My problem is that most of the times one node is totally idle although I have tried to increase the parallelism by partitioning the data in many partitions prior to starting the job. In addition I repartition the data depending on the parallelism parameter of Spark, but no matter what I try only one node is always in use.

My last effort was the following script which did not do much. It increased the performance of the working node but the idle node remained idle.

The script I use to run the job is the following:

 gcloud dataproc clusters create mycluster \
 --zone europe-west1-b \
 --master-machine-type n1-standard-8 \
 --master-boot-disk-size 500 \
 --num-workers 2 \
 --worker-machine-type n1-standard-16 \
 --worker-boot-disk-size 500 \
 --scopes 'https://www.googleapis.com/auth/cloud-platform' \
 --project myproject


gcloud dataproc jobs submit spark --cluster mycluster \
--class Main \
--properties \
spark.driver.memory=38g,\
spark.driver.maxResultSize=1g,\
spark.executor.memory=45g,\
spark.driver.cores=4,\
spark.executor.cores=16,\
spark.dynamicAllocation.enabled=true,\
spark.shuffle.service.enabled=true,\
spark.dynamicAllocation.minExecutors=32,\
spark.executor.heartbeatInterval=36000s,\
spark.network.timeout=86000s,\
spark.default.parallelism=1000,\
spark.driver.extraJavaOptions=-XX:+UseConcMarkSweepGC,\
spark.executor.extraJavaOptions=-XX:+UseConcMarkSweepGC \
--files /path/to/file/properties.properties \
--jars myjar.jar 
customArg1=value1
flagA=false 
flagB=true 
orestis
  • 932
  • 2
  • 9
  • 23

2 Answers2

2

One node is remaining idle, because it is running the YARN AppMaster, and doesn't have enough room left for an executor.

If you set spark.yarn.am.memory=1g,spark.yarn.am.memoryOverhead=384, you should use all nodes.

See this documentation for more information.

Patrick Clay
  • 1,339
  • 7
  • 5
0

On top of Patrick Clay answer, first, here is some citation:

"Every container cluster has a single master endpoint, which is managed by Container Engine. The master provides a unified view into the cluster and, through its publicly-accessible endpoint, is the doorway for interacting with the cluster."

I had same problem (except with gcloud container cluster ..) and for me to have one pod per node scheduled and running, even on small cluster where master node influence is visible, I had to set CPU limit such that it will run

Here is my pod.json (some things skipped):

{
  "kind": "Pod",
  "apiVersion": "v1",
  "spec": {
     "containers": [
       {
        "resources": {
            "limits": {
                "cpu": "700m"
            }
        }
      }
    ],
  }
}
Severin Pappadeux
  • 18,636
  • 3
  • 38
  • 64