I'm building a Spark application and I want to generate two separate shaded .jar files, one for each of these contexts:
For master=local mode, I want a single .jar file that can be executed with
java -jar shaded-for-local-mode.jar
. This should include all of the dependencies, including the Spark and Hadoop dependencies in use.For distributed mode, I want a single .jar file that excludes the Spark and Hadoop libraries (
org.apache.hadoop:*
) so that they do not conflict with the runtime environment provided byspark-submit
(spark docs), but I also want to include a dependency that is within theorg.apache.hadoop
group (org.apache.hadoop:hadoop-aws
) because it isn't provided by the runtime environment.
This answer explained how to create two jars by using two separate <execution>
blocks, but I'm having trouble getting the excludes to work as I want them.
Here's the relevant shade <execution>
:
<execution>
<id>shade-spark-submit</id>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<shadedClassifierName>shaded-spark-submit</shadedClassifierName>
<artifactSet>
<excludes>
<exclude>org.apache.spark:*</exclude>
<!-- We want hadoop-aws, an explicit dependency of this project, but not any of the other hadoop packages. -->
<exclude>org.apache.hadoop:hadoop-auth</exclude>
<exclude>org.apache.hadoop:hadoop-common</exclude>
<exclude>org.apache.hadoop:hadoop-annotations</exclude>
</excludes>
</artifactSet>
<finalName>${project.artifactId}-${project.version}-shade-spark-submit</finalName>
</configuration>
</execution>
My concern is that the <exclude>
statements are explicitly excluding specific packages that I know about, which is slightly different than my goal: I want to always exclude org.apache.hadoop:*
(even ones I don't happen to know about) but include org.apache.hadoop:hadoop-aws
. The documentation for shade does not fully describe how <include>
and <exclude>
tags are processed.
Thanks!