I am getting the below error while trying to execute spark submit using oozie on emr

Question

I am running on cluster mode. The apacheds-kerberos-codec-2.0.0-M15.jar is present in multiple places in oozie/share/lib/lib*/spark and oozie/share/lib/lib*/oozie. Is this an environmental issue ?

ava.lang.IllegalArgumentException: Attempt to add (hdfs://ip-172-20-10-53.ec2.internal:8020/user/oozie/share/lib/lib_20170208121307/oozie/apacheds-kerberos-codec-2.0.0-M15.jar) multiple times to the distributed cache.
    at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$8.apply(Client.scala:608)
    at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$8.apply(Client.scala:599)
    at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
    at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11.apply(Client.scala:599)
    at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11.apply(Client.scala:598)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:598)
    at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:868)
    at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:170)
    at org.apache.spark.deploy.yarn.Client.run(Client.scala:1154)
    at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1213)
    at org.apache.spark.deploy.yarn.Client.main(Client.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
    at org.apache.oozie.action.hadoop.SparkMain.runSpark(SparkMain.java:338)
    at org.apache.oozie.action.hadoop.SparkMain.run(SparkMain.java:257)
    at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:60)
    at org.apache.oozie.action.hadoop.SparkMain.main(SparkMain.java:78)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:232)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:455)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:344)
    at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.runSubtask(LocalContainerLauncher.java:380)
    at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.runTask(LocalContainerLauncher.java:301)
    at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.access$200(LocalContainerLauncher.java:187)
    at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler$1.run(LocalContainerLauncher.java:230)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Welcome to Stack Overflow! You can take the [tour] first and learn [ask] a good question and create a [mcve]. That makes it easier for us to help you. — Katie, Feb 08 '17 at 15:34
Are you using Hadoop-3 release code base. Hadoop 3 no longer allows duplicate filenames in the distributed cache, even if they're the same file. — YoungHobbit, Feb 09 '17 at 03:34

score 0 · Accepted Answer · answered Feb 22 '17 at 16:24

It appears that the oozie sharelib and the spark sharelib directory share the same jars, and running a spark workflow imports both directories, which hadoop-3 code base doesn't like.

I've had to reorganize the oozie sharelib directory to only have oozie specific jars such that there are no duplicates between both oozie and spark sharelib dirs:

export HADOOP_USER_NAME=oozie
hdfs dfs -mv /user/oozie/share/lib/lib_20170222143042/oozie /user/oozie/share/lib/lib_20170222143042/oozie.old
hdfs dfs -mkdir /user/oozie/share/lib/lib_20170222143042/oozie
hdfs dfs -cp /user/oozie/share/lib/lib_20170222143042/oozie.old/oozie-hadoop-utils-hadoop-2-4.3.0.jar /user/oozie/share/lib/lib_20170222143042/oozie
hdfs dfs -cp /user/oozie/share/lib/lib_20170222143042/oozie.old/oozie-sharelib-oozie-4.3.0.jar /user/oozie/share/lib/lib_20170222143042/oozie

This fixes the immediate issue of being able to run spark workflows from oozie, but I'm not sure if this affects non-spark workflows.

score 0 · Answer 2 · answered Jun 08 '17 at 12:53

I have an oozie job that starts a spark job, running in Amazon EMR. I got the same error when the EMR Hadoop setup has one instance for the master and one instance for the slaves. When I increased the amount of instances for the slaves to two, everything worked as expected.

I am getting the below error while trying to execute spark submit using oozie on emr

2 Answers2