I am trying to run a simple Java Spark job using Oozie on an EMR cluster. The job just takes files from an input path, does few basic actions on it and places the result in different output path.
When I try to run it from command line using spark-submit as shown below, it works fine:
spark-submit --class com.someClassName --master yarn --deploy-mode cluster /home/hadoop/some-local-path/my-jar-file.jar yarn s3n://input-path s3n://output-path
Then I set up the same thing in an Oozie workflow. However, when run from there the job always fails. The stdout log contains this line:
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, Attempt to add (hdfs://[emr-cluster]:8020/user/oozie/workflows/[WF-Name]/lib/[my-jar-file].jar) multiple times to the distributed cache.
java.lang.IllegalArgumentException: Attempt to add (hdfs://[emr-cluster]:8020/user/oozie/workflows/[WF-Name]/lib/[my-jar-file].jar) multiple times to the distributed cache.
I found a KB note and another question here on StackOverflow that deals with a similar error. But for them, the job was failing due to an internal JAR file - not the one the user is passing to run. Nonetheless, I tried out its resolution steps to remove jar files common between spark & oozie in share-lib and ended up removing a few files from "/user/oozie/share/lib/lib_*/spark". Unfortunately, that did not solve the problem either.
Any ideas on how to debug this issue?