EMR Cluster utilization

Asked Dec 20 '18 at 23:37

Active Dec 20 '18 at 23:37

Viewed 138 times

I have a 20 mode c4.4xlarge cluster to run a spark job. Each node is a 16 vCore, 30 GiB memory, EBS only storage EBS Storage:32 GiB machine.

Since each node has 16 vCore, I understand that maximum number of executors are 16*20 > 320 executors. Total memory available is 20(#nodes)*30 ~ 600GB. Assigning 1/3rd to system operations, I have 400 GB of Memory to process my data in-memory. Is this the right understanding.

Also, Spark History shows non-uniform distribution of input and shuffle. I believe the processing is not distributed evenly across executors. I pass these config parameters in my spark-submit -

> —-conf spark.dynamicAllocation.enabled=true  —-conf spark.dynamicAllocation.minExecutors=20

Executor summary from spark history UI also shows that data distribution load is completely skewed, and I am not using the cluster in the best way. How can I distribute my load in a better way -

asked Dec 20 '18 at 23:37

Abhi

1,153
1
23
38

This was done running a single job in the cluster. However, leveraging Fair-scheduler has increased overall parallelism and cluster utilization, thus giving me better throughput – Abhi Jan 09 '19 at 20:44

EMR Cluster utilization

0 Answers0