-1

When you pick a more performant node, say a r3.xlarge vs m3.xlarge, will Spark automatically utilize the additional resources? Or is this something you need to manually configure and tune?

As far as configurations go, which are the most configuration values to tune to get the most out of your cluster?

flybonzai
  • 3,763
  • 11
  • 38
  • 72

1 Answers1

0

It will try..

AWS has a setting you can enable in your EMR cluster configuration that will attempt to do this. It is called spark.dynamicAllocation.enabled. In the past there were issues with this setting where it would give too many resources to Spark. In newer releases they have lowered the amount they are giving to spark. However, if you are using Pyspark they will not take python's resource requirements into account.

I typically disable dynamicAllocation and set the appropriate memory and cores settings dynamically from my own code based upon what instance type is selected.

This page discusses what defaults they will select for you: http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-spark-configure.html

If you do it manually, at a minimum you will want to set:

spark.executor.memory
spark.executor.cores

Also, you may need to adjust the yarn container size limits with:

yarn.scheduler.maximum-allocation-mb
yarn.scheduler.minimum-allocation-mb
yarn.nodemanager.resource.memory-mb

Make sure you leave a core and some RAM for the OS, and RAM for python if you are using Pyspark.

Ryan Widmaier
  • 7,948
  • 2
  • 30
  • 32
  • To clarify the dynamic allocation is a property and function of Spark itself. When using PySpark and depending on what you need from Python (as interpreter runs outside JVM heap) it may necessary to increase spark.yarn.[driver|executor].memoryOverhead (http://spark.apache.org/docs/latest/running-on-yarn.html#configuration). Also I highly recommend not adjusting the scheduler and nodemanager resources as exceeding the defaults run the risk of over subscribing the memory. – ChristopherB Nov 09 '16 at 12:38