Ignore Spark Cluster Own Jars

Question

I would like to use my own application Spark jars. More in concrete I have one jar of mllib that is not already released that contains a fixed bug of BisectingKMeans. So, my idea is to use it in my spark cluster (in locally it works perfectly).

I've tried many things: extraclasspath, userClassPathFirst, jars option...many options that do not work. My last idea is to use the Shade rule of sbt to change all org.apache.spark.* packages to shadespark.* but when I deploy it is still using the cluster' spark jars.

Any idea?

rdeboo · Accepted Answer · 2017-02-25T22:23:57.350

You can try to use the Maven shade plugin to relocate the conflicting packages. This creates a separate namespace for the newer version of the mllib jar. So both the old and the new version will be on the classpath, but since the new version has an alternative name you can refer to the newer package explicitly.

Have a look at https://maven.apache.org/plugins/maven-shade-plugin/examples/class-relocation.html:

If the uber JAR is reused as a dependency of some other project, directly including classes from the artifact's dependencies in the uber JAR can cause class loading conflicts due to duplicate classes on the class path. To address this issue, one can relocate the classes which get included in the shaded artifact in order to create a private copy of their bytecode:

I got this idea from the video "Top 5 Mistakes When Writing Spark Applications": https://youtu.be/WyfHUNnMutg?t=23m1s

What part of that video Grover or Malaska talks about the shading plugin ? — eliasah, Feb 23 '17 at 07:54
I tried this solution and seems that worked, every package was renamed but still took the spark jars. A provisional solution was to replace the mllib jar of Spark (in jars folder) and use a newer one and it worked. – — Gorka, Mar 21 '17 at 13:28

Ignore Spark Cluster Own Jars

1 Answers1

Linked