Correct spark configuration to fully utilise EMR cluster resources

Question

I'm quite new to configuring spark, so wanted to know whether I am fully utilising my EMR cluster. The EMR cluster is using spark 2.4 and hadoop 2.8.5.

The app reads loads of small gzipped json files from s3, transforms the data and writes them back out to s3.

I've read various articles, but I was hoping I could get my configuration double checked in case there were set settings that conflict with each other or something.

I'm using a c4.8xlarge cluster with each of the 3 worker nodes having 36 cpu cores and 60gb of ram. So that's 108 cpu cores and 180gb of ram overall.

Here is my spark-submit settings that I paste in the EMR add step box:

--class com.example.app
--master yarn
--driver-memory 12g
--executor-memory 3g
--executor-cores 3
--num-executors 33
--conf spark.executor.memory=5g
--conf spark.executor.cores=3
--conf spark.executor.instances=33
--conf spark.driver.cores=16
--conf spark.driver.memory=12g
--conf spark.default.parallelism=200
--conf spark.sql.shuffle.partitions=500
--conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 
--conf spark.speculation=false
--conf spark.yarn.am.memory=1g
--conf spark.executor.heartbeatInterval=360000
--conf spark.network.timeout=420000
--conf spark.hadoop.fs.hdfs.impl.disable.cache=true
--conf spark.kryoserializer.buffer.max=512m
--conf spark.shuffle.consolidateFiles=true
--conf spark.hadoop.fs.s3a.multiobjectdelete.enable=false
--conf spark.hadoop.fs.s3a.fast.upload=true
--conf spark.worker.instances=3

Correct spark configuration to fully utilise EMR cluster resources

0 Answers0