3

The oozie workflow launcher sometimes fails (KILLED status) due to the loading order of the classpath. In SparkSubmit, a call to method in ivy 2.4.0 exists, but this particular method is not in ivy 2.0.0-rc2. The workflow process usually runs fine (SUCCEEDED) for most hourly nominal times, but the launch infrequently fails as ivy 2.0 is loaded instead of ivy 2.4. Upon failure, the (redacted) oozie launcher log shows this stack call:

2017-10-31 20:37:30,339 WARN org.apache.oozie.action.hadoop.SparkActionExecutor: SERVER[xxxx-oozie-lv-102.xxx.net] USER[xxxxx] GROUP[-] TOKEN[] APP[xxxx-proc-oozie] JOB[0143924-170929213137940-oozie-oozi-W] ACTION[0143924-170929213137940-oozie-oozi-W@xxxx] Launcher exception: org.apache.ivy.core.module.descriptor.DefaultModuleDescriptor.setDefaultConf(Ljava/lang/String;)V
java.lang.NoSuchMethodError: org.apache.ivy.core.module.descriptor.DefaultModuleDescriptor.setDefaultConf(Ljava/lang/String;)V
    at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1054)
    at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:287)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:154)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
    at org.apache.oozie.action.hadoop.SparkMain.runSpark(SparkMain.java:264)
    at org.apache.oozie.action.hadoop.SparkMain.run(SparkMain.java:214)
    at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:60)
    at org.apache.oozie.action.hadoop.SparkMain.main(SparkMain.java:52)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:233)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1912)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

It seems that Cloudera Distributed Hadoop contains ivy 2.0.0-rc2, but its SparkSubmit seems to require ivy version 2.4.0. I have tried to include ivy 2.4 in my jar and excluding 2.0, but this is even before my process is launched (so maybe this is a bit of ridiculous). I figure there must be a way to force the 2.4.0 version to have some precedence in the oozie loading process and have tried oozie.launcher.mapreduce.user.classpath.first to either true or false -- In any case, the job properties file does/must contain:

oozie.libpath=${nameNode}/user/spark/share/XXXX-spark/
oozie.use.system.libpath=true

Note: Dropping ivy in the libpath above didn't seem to make a difference.

It's likely that the workflow needs an extra flag or ... like this:

<configuration>
   <property>
      <name>oozie.launcher.mapreduce.map.java.opts</name>
      <value>-verbose</value>
   </property>
</configuration>

The team (SRE) that manages the cluster prefers to use the original jars with the CDH 5.9.2.

How can I force spark-submit to use ivy 2.4 (and not 2.0) by changing the workflow.xml, job properties, my build or ... in a way that will satisfy SRE requirements to keep the CDH intact? Can I solve this by invalidating the cache.

Please be aware that mentioning to add the ivy 2.4.0 jar to a classpath needs some details of exactly where to put the ivy jar on hdfs, accessing the jar in some path or ...

codeaperature
  • 1,089
  • 2
  • 10
  • 25
  • 1
    Do you run your Spark job in `local` mode (i.e. only in the Oozie launcher container) or in `yarn-client` mode? If `yarn-client`, does the exception occur in the driver or in the executors (which do not inherit Oozie libpath nor `oozie.launcher` props)? – Samson Scharfrichter Nov 02 '17 at 18:14
  • Similar to https://stackoverflow.com/questions/42689304/spark-job-that-use-hive-context-failing-in-oozie – Samson Scharfrichter Nov 02 '17 at 18:21
  • 1st: This issue is similar to community.cloudera.com/t5/Batch-Processing-and-Workflow/…. 2nd @SamsonScharfrichter ... This issue occurs using yarn-cluster mode using default client-mode and shows in the workflow log. I'm not sure if I am answering the 2nd part of your question. I tried the suggestions in the link but still didn't get past run #17 - a typical fail. – codeaperature Nov 03 '17 at 17:53
  • If there was always a conflict between, say, two JARs for V2.0 and one for V2.4, with random placement in the CLASSPATH, then your job would fail 66% of the time for each executor -- i.e. > 99% of the time if you have many executors. The fact that failure is _rare_ hints that the issue occurs only on **specific nodes** that have an slightly different config. So, you have to track the YARN logs for Spark driver and Spark executors and do some stats *per node running a container* -- that's how you can prove to your SREs that it's all their fault and they have some reconfig to do. – Samson Scharfrichter Nov 04 '17 at 14:40
  • And if indeed the problem is specific to the Oozie launcher, then it's even easier to get the exact `application_0000_000000` YARN ID. That one shows in Oozie; whereas the executors relate to a different ID (spawned by the driver, out of control of Oozie) – Samson Scharfrichter Nov 04 '17 at 14:43
  • @SamsonScharfrichter I have this exact same issue and it happens on the executors and NOT in the driver. – Mike Pone Jan 03 '18 at 23:23

1 Answers1

1

Cloudera's Spark, which is at https://github.com/cloudera/spark/blob/cdh5-1.6.0_5.9.2/pom.xml, uses Ivy 2.4.0, but the CDH distribution comes with Ivy 2.0.0-rc2.

To solve this issue, in hdfs folder = /user/oozie/share/lib/lib_{timestamp}/spark, the ivy 2.0.0-rc2 jar was replaced with the 2.4 version (which is oddly named org.apache.ivy_ivy-2.4.0.jar ... but I don't think that matters). After replacing the jar, running an oozie admin action (oozie admin -sharelibupdate spark to flush/rescan this folder), the process launches worked fine when the workflow was started thereafter.

Along the lines of Samson's comments, the ivy cache was varying across some nodes due to new nodes being added a later time and this caused an infrequent / intermittent issue.

codeaperature
  • 1,089
  • 2
  • 10
  • 25