10

I am using EMR 5.0 with Spark 2.0.0. I am trying to run child spark application from Scala spark application using org.apache.spark.launcher.SparkLauncher

I need to set SPARK_HOME using setSparkHome:

 var handle = new SparkLauncher()
    .setAppResource("s3://my-bucket/python_code.py")
    .setAppName("PythonAPP")
    .setMaster("spark://" + sparkSession.conf.get("spark.driver.host") +":"+ sparkSession.conf.get("spark.driver.port"))
    .setVerbose(true)
    .setConf(SparkLauncher.EXECUTOR_CORES, "1")
    .setSparkHome("/srv/spark") // not working
    .setDeployMode("client")
    .startApplication(
      new SparkAppHandle.Listener() {

        override def infoChanged(hndl: SparkAppHandle): Unit = {
          System.out.println(hndl.getState() + " new  state !")
        }

        override def stateChanged(hndl: SparkAppHandle): Unit = {
          System.out.println(hndl.getState() + "    new  state !")
        }
      })

Where can I find the appropriate path to my Spark Home ? The cluster is built from 1 Master, 1 Core and 1 Task servers.

Thanks!

Ulile
  • 251
  • 1
  • 3
  • 9

2 Answers2

19

As of emr-4.0.0, all applications on EMR are in /usr/lib. Spark is in /usr/lib/spark.

Jonathan Kelly
  • 1,940
  • 11
  • 14
  • Thanks, but I still get the same error: `16/09/18 09:07:02 ERROR ApplicationMaster: User class threw exception: java.io.IOException: Cannot run program "/usr/lib/spark/bin/spark-submit": error=2, No such file or directory java.io.IOException: Cannot run program "/usr/lib/spark/bin/spark-submit": error=2, No such file or directory at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) at org.apache.spark.launcher.SparkLauncher.startApplication(SparkLauncher.java:428)` – Ulile Sep 18 '16 at 09:19
  • Are you not running this on the master instance? If Spark is installed, /usr/lib/spark definitely exists on the master, but it does not exist on the other modes. – Jonathan Kelly Sep 18 '16 at 17:42
  • Btw, I also noticed that you set the Spark master to spark://..., but that is not correct for Spark on EMR, since it runs on YARN. The correct Spark master is just "yarn". – Jonathan Kelly Sep 18 '16 at 17:44
  • so if I am setting master=yarn, how can I run it on master ? – Ulile Sep 20 '16 at 12:53
  • Sorry, I'm not sure exactly what you are asking. You are running on a cluster that has Spark installed, but you claim that you still get an error saying that /usr/lib/spark/bin/spark-submit does not exist, but that's not true if you are running the command on the master instance. If that file doesn't exist, perhaps you are somehow running on the wrong instance? I'm not sure how that would be the case, but you haven't really provided enough information about how you are running this. – Jonathan Kelly Oct 26 '16 at 17:18
2

I find out that Spark on AWS EMR (tested with version emr-5.23.0 & emr-5.22.0) doesn't install Spark on EMR CORE Nodes. Just check the EMR nodes installation on /usr/lib/spark, it's not really a SPARK_HOME like the one installed on the EMR MASTER node.

Installing Spark on EMR CORE Nodes solved my issue.

nsphung
  • 173
  • 1
  • 1
  • 9
  • I am having the same issue right now, but is it really a solution to install Spark on EMR Core nodes, though? It sounds like a complete hack. I am not sure why the Core nodes don't have the same Spark setup as the Master. – mj3c May 06 '20 at 13:11
  • Well, I don't know why but it makes things working at least. You'd better asked someone on AWS EMR support. I haven't tested on recent version of EMR however, maybe it's fixed now. – nsphung May 06 '20 at 13:25