How do I specify the Spark configuration when running on EMR?

Question

So I'm trying to run a Spark pipeline on EMR, and I'm creating a step like so:

// Build the Spark job submission request
val runSparkJob = new StepConfig()
  .withName("Run Pipeline")
  .withActionOnFailure(ActionOnFailure.TERMINATE_CLUSTER)
  .withHadoopJarStep(
    new HadoopJarStepConfig()
      .withJar(jarS3Path)
      .withMainClass("com.example.SparkApp")
  )

Problem is, when I run this, I encounter an exception like so:

org.apache.spark.SparkException: A master URL must be set in your configuration

The thing is, I'm trying to figure out where to specify the master URL, and I can't seem to find it. Do I specify it when setting up the pipeline run step or do I need to somehow get the master IP:port into the application and specify it in the main function?

Kamrus · Accepted Answer · 2019-07-03T05:41:24.657

You should specify it in your application when create instance of SparkSession

Example for local run (Scala code)

val sparkSessionBuilder = SparkSession
      .builder()
      .appName(getClass.getSimpleName)
      .master("local[*]")
      .config("spark.driver.host", "localhost")

And you can find more information in jaceklaskowski.gitbooks.io or in spark.apache.org

When you launch a cluster you should specify step with command-runner.jar and pass to args you jar

val runSparkJob = new StepConfig()
  .withName("Run Pipeline")
  .withActionOnFailure(ActionOnFailure.TERMINATE_CLUSTER)
  .withHadoopJarStep(
    new HadoopJarStepConfig()
      .withJar("command-runner.jar")
      .withArgs("spark-submit",
           "--deploy-mode", "cluster",
           "--driver-memory", "10G",
           "--class", <your_class_to_run>,
           "s3://path_to_your_jar")

To submit work to Spark using the SDK for Java

Ha! This is the solution: I was thinking the `withJar` was meant to specify the actual spark jar. — alexgolec, Jul 03 '19 at 15:22

score 0 · Answer 2 · answered Jul 03 '19 at 16:18

With in your spark application you can do below... is option1

val sparkSessionBuilder = SparkSession
      .builder()
      .appName(getClass.getSimpleName)
      .master("yarn")

if you want to add it to stepconfig.... is option 2

// Define Spark Application
        HadoopJarStepConfig sparkConfig = new HadoopJarStepConfig()
            .withJar("command-runner.jar")
            .withArgs("spark-submit,--deploy-mode,cluster,--master,yarn"),
                    "--class","com.amazonaws.samples.TestQuery",
                    "s3://20180205-kh-emr-01/jar/emrtest.jar", "10", "Step Test"); // optional list of arguments

StepConfig customStep = new StepConfig()
                 .withHadoopJarStep(sparkConfig)  
                 .withName("SparkSQL") ;

I prefer option 2 since its not hard coded in the code.

How do I specify the Spark configuration when running on EMR?

2 Answers2