How can I increase spark.driver.memoryOverhead in Google dataproc?

Question

I am getting two types of errors in running a job on Google dataproc and it is causing executors to be lost one by one until the last executor is lost and the job fails. I have set my master node to n1-highmem-2 (2 vCPU, 13 GB memory) and have set two worker nodes to n1-highmem-8 (8 vCPU, 52 GB memory). The two errors I get are:

"Container exited from explicit termination request."
"Lost executor x: Executor heartbeat timed out"

From my understanding in what I could see online, I need to increase spark.executor.memoryOverhead. I don't know if this is the right answer or not, but I can't see how to change this in the Google dataproc console, and I don't know what to change it to. Any help would be great!

Thanks, jim

Dagang · Answer 1 · 2022-05-01T01:39:26.410

0

You can set Spark properties at cluster-level with

gcloud dataproc clusters create ... --properties spark:<name>=<value>,...

and/or job-level with

gcloud dataproc jobs submit spark ... --properties <name>=<value>,...

The former requires the spark: prefix, the latter doesn't. If both are set, the latter takes precedence. See more details in this doc.

edited May 01 '22 at 01:39

answered Apr 30 '22 at 05:18

Dagang

24,586
26
88
133

Thank you. This works, but I changed the value to 600 as I saw in another StackOverflow response, and it doesn't help my original problem of the executors being terminated one by one until the job fails. – jmuth May 03 '22 at 15:53
Did you update other Spark property values? – Dagang May 03 '22 at 16:55
The only other property I modified was in the job, spark.submit.deployMode was set to client. – jmuth May 04 '22 at 20:52
Are you using autoscaling or preemptible VM? – Dagang May 06 '22 at 04:32
No, I have not defined an autoscaling property (I've set the Policy to None on the Set up cluster page). It is my understanding the secondary workers are preemptible, so yes, my secondary workers are preemptible VM. – jmuth May 09 '22 at 17:05

score 0 · Answer 2 · answered May 17 '22 at 22:52

It turns out the memory per vCPU was the limitation causing the executors to fail one by one. Initially, I was trying to use the custom configuration in console mode for the cluster to add add'l memory per vCPU. It turns that the UI has some bug (per the Google Dataproc team), which limits you from increasing the memory per vCPU (if you use the slider bar to increase the memory beyond the default max of 6.5GB, the cluster set up will fail). However, if you use the command line equivalent of the console, it does allow the set up of the cluster and the increased memory per vCPU was enough to complete the job w/o the executors failing one by one.

How can I increase spark.driver.memoryOverhead in Google dataproc?

2 Answers2