I have a spark program that I'm running on EMR. I noticed that when I change my cluster to have more cores (by choosing an instance type that has more cores), 1) it takes considerably longer to complete the job, and 2) it never actually completes because it errors out.
Specifically, my job takes 4 minutes to complete when I use 19 c3.4xlarge
slave nodes (so 304 cores), with 57 executors and 1140 partitions. But when I change to 20 c4.8xlarge
slave nodes (so 720 cores) with 140 executors and 2800 partitions, it fails after 22 minutes.
Why is this the behavior? I would expect that by increasing the number of cores (and in effect, the number of partitions), the job would speed up. Furthermore, I'm unsure why the second scenario is failing.
In both cases, I have approximately 5 cores per executor, and 4 times the amount of partition as there are cores (assuming one core per node is used for system tasks and YARN agents).
Here is my spark-submit as requested in the comment below:
spark-submit --deploy-mode client --master yarn --num-executors 57 --class someMainClass /path/to/local/JAR