Oozie job won't run if using PySpark in SparkAction

Question

I've encountered several examples of SparkAction jobs in Oozie, and most of them are in Java. I edit a little and run the example in Cloudera CDH Quickstart 5.4.0 (with Spark version 1.4.0).

workflow.xml

<workflow-app xmlns='uri:oozie:workflow:0.5' name='SparkFileCopy'>
    <start to='spark-node' />

    <action name='spark-node'>
        <spark xmlns="uri:oozie:spark-action:0.1">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <prepare>
                <delete path="${nameNode}/user/${wf:user()}/${examplesRoot}/output-data/spark"/>
            </prepare>
            <master>${master}</master>
        <mode>${mode}</mode>    
            <name>Spark-FileCopy</name>
            <class>org.apache.oozie.example.SparkFileCopy</class>
            <jar>${nameNode}/user/${wf:user()}/${examplesRoot}/apps/spark/lib/oozie-examples.jar</jar>
            <arg>${nameNode}/user/${wf:user()}/${examplesRoot}/input-data/text/data.txt</arg>
            <arg>${nameNode}/user/${wf:user()}/${examplesRoot}/output-data/spark</arg>
        </spark>
        <ok to="end" />
        <error to="fail" />
    </action>

    <kill name="fail">
        <message>Workflow failed, error
            message[${wf:errorMessage(wf:lastErrorNode())}]
        </message>
    </kill>
    <end name='end' />
</workflow-app>

job.properties

nameNode=hdfs://quickstart.cloudera:8020
jobTracker=quickstart.cloudera:8032
master=local[2]
mode=client
examplesRoot=examples
oozie.use.system.libpath=true
oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/apps/spark

The Oozie workflow example (in Java) was able to complete and do its task.

I've written a spark-submit job using Python / PySpark however. I tried removing <class> and for the jar

<jar>my_pyspark_job.py</jar>

but I get error in the logs when I attemp to run the Oozie-Spark job:

Launcher ERROR, reason: Main class [org.apache.oozie.action.hadoop.SparkMain], exit code [2]

I wonder what should I be placing in <class> and <jar> tags if I'm using Python / PySpark?

score 6 · Answer 1 · answered Oct 13 '15 at 05:22

I too struggled a lot with the spark-action in oozie. I setup the sharelib properly and tried to pass the the appropriate jars using the --jars option within the <spark-opts> </spark-opts> tags, but to no avail.

I always ended up getting some error or the other. The most I could do was run all java/python spark jobs in local mode through the spark-action.

However, I got all my spark jobs running in oozie in all modes of execution using the shell action. The major problem with the shell action is that shell jobs are deployed as the 'yarn' user. If you happen to deploy your oozie spark job from a user account other than yarn, you'll end up with a Permission Denied error (because the user would not be able to access the spark assembly jar copied into /user/yarn/.SparkStaging directory). The way to solve this is to set the HADOOP_USER_NAME environment variable to the user account name through which you deploy your oozie workflow.

Below is a workflow that illustrates this configuration. I deploy my oozie workflows from the ambari-qa user.

<workflow-app xmlns="uri:oozie:workflow:0.4" name="sparkjob">
    <start to="spark-shell-node"/>
    <action name="spark-shell-node">
        <shell xmlns="uri:oozie:shell-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <configuration>
                <property>
                    <name>oozie.launcher.mapred.job.queue.name</name>
                    <value>launcher2</value>
                </property>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>default</value>
                </property>
                <property>
                    <name>oozie.hive.defaults</name>
                    <value>/user/ambari-qa/sparkActionPython/hive-site.xml</value>
                </property>
            </configuration>
            <exec>/usr/hdp/current/spark-client/bin/spark-submit</exec>
            <argument>--master</argument>
            <argument>yarn-cluster</argument>
            <argument>wordcount.py</argument>
            <env-var>HADOOP_USER_NAME=ambari-qa</env-var>
            <file>/user/ambari-qa/sparkActionPython/wordcount.py#wordcount.py</file>
            <capture-output/>
        </shell>
        <ok to="end"/>
        <error to="spark-fail"/>
    </action>
    <kill name="spark-fail">
        <message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name="end"/>
</workflow-app>

Hope this helps!

sashaostr · Answer 2 · 2015-09-02T06:40:07.503

You should try configure the Oozie Spark action to bring needed files locally. You can make it using a file tag:

<spark xmlns="uri:oozie:spark-action:0.1">
        <job-tracker>${resourceManager}</job-tracker>
        <name-node>${nameNode}</name-node>
        <master>local[2]</master>
        <mode>client</mode>
        <name>${name}</name>
        <jar>my_pyspark_job.py</jar>
        <file>{path to your file on hdfs}/my_pyspark_job.py#my_pyspark_job.py</file>
    </spark>

Explanation: Oozie action running inside YARN container which is allocated by YARN on the node which has available resources. Before running the action (which is actually a "driver" code) it copies all needed files (jars for example) locally to the node into folder allocated for YARN container to put its resources. So by adding tag to oozie action you "telling" your oozie action to bring the my_pyspark_job.py locally to the node of execution.

In my case I want to run a bash script (run-hive-partitioner.bash) which will run a python code (hive-generic-partitioner.py), so I need all files locally accessible on the node:

<action name="repair_hive_partitions">
  <shell xmlns="uri:oozie:shell-action:0.1">
    <job-tracker>${jobTracker}</job-tracker>
    <name-node>${nameNode}</name-node>
    <exec>${appPath}/run-hive-partitioner.bash</exec>
         <argument>${db}</argument>
         <argument>${tables}</argument>
         <argument>${base_working_dir}</argument>
    <file>${appPath}/run-hive-partitioner.bash#run-hive-partitioner.bash</file>
    <file>${appPath}/hive-generic-partitioner.py#hive-generic-partitioner.py</file>
     <file>${appPath}/util.py#util.py</file>     
  </shell>
  <ok to="end"/>
  <error to="kill"/>
</action>

where ${appPath} is hdfs://ci-base.com:8020/app/oozie/util/wf-repair_hive_partitions

so this is what I get in my job:

Files in current dir:/hadoop/yarn/local/usercache/hdfs/appcache/application_1440506439954_3906/container_1440506439954_3906_01_000002/

======================
File: hive-generic-partitioner.py
File: util.py
File: run-hive-partitioner.bash
...
File: job.xml
File: json-simple-1.1.jar
File: oozie-sharelib-oozie-4.1.0.2.2.4.2-2.jar
File: launch_container.sh
File: oozie-hadoop-utils-2.6.0.2.2.4.2-2.oozie-4.1.0.2.2.4.2-2.jar

As you can see it oozie (or actually yarn I think) shipped all needed files locally to the temp folder and now it's able to run it.

Hi, File tag is not working spark action. Could you please let me know how to bring files locally in spark action — Ram Manohar, Apr 07 '16 at 20:12
The file tag works only with Oozie 4.3.0 and you should user uri:oozie:spark-action:0.2 - Check the documentation https://oozie.apache.org/docs/4.3.0/DG_SparkActionExtension.html#Spark_Action_Schema_Version_0.2 — Sigrist, Jan 26 '17 at 12:19

score 0 · Answer 3 · answered Jul 17 '15 at 09:06

I was able to "fix" this issue although it leads to another issue. Nonetheless, I will still post it.

In stderr of the Oozie container logs, it shows:

Error: Only local python files are supported

And I found a solution here

This is my initial workflow.xml:

    <spark xmlns="uri:oozie:spark-action:0.1">
        <job-tracker>${resourceManager}</job-tracker>
        <name-node>${nameNode}</name-node>
        <master>local[2]</master>
        <mode>client</mode>
        <name>${name}</name>
        <jar>my_pyspark_job.py</jar>
    </spark>

What I did initially was to copy to HDFS the Python script I wish to run as spark-submit job. It turns out that it expects the .py script in the local file system, so I what I did was to refer to the absolute local file system of my script.

<jar>/<absolute-local-path>/my_pyspark_job.py</jar>

score 0 · Answer 4 · answered Dec 01 '15 at 20:50

0

We were getting same error. If you try to drop spark-assembly jar from '/path/to/spark-install/lib/spark-assembly*.jar' (depends upon distribution) to your oozie.wf.application.path/lib dir along side your application jar it should work.

answered Dec 01 '15 at 20:50

nir

3,743
4
39
63

Oozie job won't run if using PySpark in SparkAction

4 Answers4

Linked