2

Please bear with me because I am still quite new to Spark.

I have a GCP DataProc cluster which I am using to run a large number of Spark jobs, 5 at a time.

Cluster is 1 + 16, 8 cores / 40gb mem / 1TB storage per node.

Now I might be misunderstanding something or not doing something correctly, but I currently have 5 jobs running at once, and the Spark UI is showing that only 34/128 vcores are in use, and they do not appear to be evenly distributed (The jobs were executed simultaneously, but the distribution is 2/7/7/11/7. There is only one core allocated per running container.

I have used the flags --executor-cores 4 and --num-executors 6 which doesn't seem to have made any difference.

Can anyone offer some insight/resources as to how I can fine tune these jobs to use all available resources?

Cam
  • 2,026
  • 3
  • 25
  • 42

2 Answers2

1

I have managed to solve the issue - I had no cap on the memory usage so it looked as though all memory was allocated to just 2 cores per node.

I added the property spark.executor.memory=4G and re-ran the job, it instantly allocated 92 cores.

Hope this helps someone else!

Cam
  • 2,026
  • 3
  • 25
  • 42
0

The Dataproc default configurations should take care of the number of executors. Dataproc also enables dynamic allocation, so executors will only be allocated if needed (according to Spark).

Spark cannot parallelize beyond the number of partitions in a Dataset/RDD. You may need to set the following properties to get good cluster utilization:

  • spark.default.parallelism: the default number of output partitions from transformations on RDDs (when not explicitly set)
  • spark.sql.shuffle.partitions: the number of output partitions from aggregations using the SQL API

Depending on your use case, it may make sense to explicitly set partition counts for each operation.

Ben Sidhom
  • 1,548
  • 16
  • 25