11

It seems that by default EMR deploys the Spark driver to one of the CORE nodes, resulting in the MASTER node being virtually un-utilized. Is it possible to run the driver program on the MASTER node instead? I have experimented with the --deploy-mode arguments to no avail.

Here is my instance groups JSON definition:

[
  {
    "InstanceGroupType": "MASTER",
    "InstanceCount": 1,
    "InstanceType": "m3.xlarge",
    "Name": "Spark Master"
  },
  {
    "InstanceGroupType": "CORE",
    "InstanceCount": 3,
    "InstanceType": "m3.xlarge",
    "Name": "Spark Executors"
  }
]

Here is my configurations JSON definition:

[
  {
    "Classification": "spark",
    "Properties": {
      "maximizeResourceAllocation": "true"
    },
    "Configurations": []
  },
  {
    "Classification": "spark-env",
    "Properties": {
    },
    "Configurations": [
      {
        "Classification": "export",
        "Properties": {
        },
        "Configurations": [
        ]
      }
    ]
  }
]

Here is my steps JSON definition:

[
  {
    "Name": "example",
    "Type": "SPARK",
    "Args": [
      "--class", "com.name.of.Class",
      "/home/hadoop/myjar-assembly-1.0.jar"
    ],
    "ActionOnFailure": "TERMINATE_CLUSTER"
  }
]

I am using aws emr create-cluster with --release-label emr-4.3.0.

Landon Kuhn
  • 76,451
  • 45
  • 104
  • 130
  • 2
    As far as I know, the answer is no. The master node's sole responsibility seems to be running YARN. – Glennie Helles Sindholt Feb 05 '16 at 06:30
  • I though maybe I could get a slave to run the Spark master and an executor by setting spark.executor.instances higher than the number of nodes, but it didn't work – Landon Kuhn Feb 05 '16 at 18:59
  • This is the nature of Spark on YARN. If you set the deploy mode to client then driver will run on master mode and only a small application master will run on a slave node. Also, if you forgo the maximizeResourceAllocation and specify exactly what you want for driver, executor and application master (basically squeezing this one down) you can tune the cluster to your application needs. May even experiment with the dynamic resource allocation http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-spark-configure.html#spark-dynamic-allocation. – ChristopherB Feb 13 '16 at 02:16
  • Quite wasteful. – thebluephantom Feb 17 '19 at 13:04

2 Answers2

1

Setting the location of the driver

With spark-submit, the flag --deploy-mode can be used to select the location of the driver.

Submitting applications in client mode is advantageous when you are debugging and wish to quickly see the output of your application. For applications in production, the best practice is to run the application in cluster mode. This mode offers you a guarantee that the driver is always available during application execution. However, if you do use client mode and you submit applications from outside your EMR cluster (such as locally, on a laptop), keep in mind that the driver is running outside your EMR cluster and there will be higher latency for driver-executor communication.

https://blogs.aws.amazon.com/bigdata/post/Tx578UTQUV7LRP/Submitting-User-Applications-with-spark-submit

Pankaj Arora
  • 544
  • 2
  • 6
1

I don't think it is a waste. When running Spark on EMR, the master node will run Yarn RM, Livy Server, and maybe other applications you selected. And if you run in client mode, the majority of the driver program will run on the master node as well.

Note that the driver program could be heavier than the tasks on executors, e.g. collecting all results from all executors, in which case you need to allocate enough resources to your master node if it is where the driver program is running.

Z.Wei
  • 3,658
  • 2
  • 17
  • 28