"KeyError: 'SPARK_HOME' ", "can't load main class from JAR" in running PySpark as an Oozie workflow job

Question

This issue is a continuation of my previous question here, which was seemingly resolved but leads to here as another issue.

I am using Spark 1.4.0 on Cloudera QuickstartVM CHD-5.4.0. When I run my PySpark script as a SparkAction in Oozie, I encounter this error in the Oozie job / container logs:

KeyError: 'SPARK_HOME'

Then I came across this solution and this which are actually for Spark 1.3.0, although I still did try. The documentations seem to say that this issue is already fixed for Spark version 1.3.2 and 1.4.0 (but here I am, encountering the same issue).

The suggested solution in the link was that I need to set spark.yarn.appMasterEnv.SPARK_HOME and spark.executorEnv.SPARK_HOME to anything, even if it's just any path that does not point to actual SPARK_HOME (i.e., /bogus, although I did set these to actual SPARK_HOME).

Here's my workflow after:

    <spark xmlns="uri:oozie:spark-action:0.1">
        <job-tracker>${resourceManager}</job-tracker>
        <name-node>${nameNode}</name-node>
        <master>local[2]</master>
        <mode>client</mode>
        <name>${name}</name>
        <jar>${workflowRootLocal}/lib/my_pyspark_job.py</jar>
        <spark-opts>--conf spark.yarn.appMasterEnv.SPARK_HOME=/usr/lib/spark spark.executorEnv.SPARK_HOME=/usr/lib/spark</spark-opts>
    </spark>

Which seems to solve the original problem above. However, it leads to another error when I try to inspect stderr of Oozie container log:

Error: Cannot load main class from JAR file:/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/cloudera/appcache/application_1437103727449_0011/container_1437103727449_0011_01_000001/spark.executorEnv.SPARK_HOME=/usr/lib/spark

If I am using Python, it should not expect for a main class right? Please note in my previous related post that the Oozie job example shipped with Cloudera QuickstartVM CDH-5.4.0, which features a SparkAction written in Java was working in my tests. It seems that the issue is only in Python.

Appreciate greatly anyone that can help.

am getting error like this "Error: SPARK_HOME does not exist for python application in yarn mode." any idea abt this ? — karthik, Sep 21 '15 at 09:30

score 1 · Answer 1 · edited May 23 '17 at 12:24

Rather than setting spark.yarn.appMasterEnv.SPARK_HOME and spark.executorEnv.SPARK_HOME variables, try and add the following lines of code to your python script before setting your SparkConf()

os.environ["SPARK_HOME"] = "/path/to/spark/installed/location"

Found the reference here

This helped me resolve the error you face, but I faced the following error afterwards

Traceback (most recent call last):
  File "/usr/hdp/current/spark-client/AnalyticsJar/boxplot_outlier.py", line 129, in <module>
    main()
  File "/usr/hdp/current/spark-client/AnalyticsJar/boxplot_outlier.py", line 60, in main
    sc = SparkContext(conf=conf)
  File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/pyspark/context.py", line 107, in __init__
  File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/pyspark/context.py", line 155, in _do_init
  File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/pyspark/context.py", line 201, in _initialize_context
  File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/py4j/java_gateway.py", line 701, in __call__
  File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s signer information does not match signer information of other classes in the same package

"KeyError: 'SPARK_HOME' ", "can't load main class from JAR" in running PySpark as an Oozie workflow job

1 Answers1