0

My use case is pretty simple, I want to override a few classes that are part of the Hadoop distribution, to do so I created a new jar that I serialize from the driver to the worker nodes using spark.jars properties.

To make sure my new jar takes precedence in the workers classpath, I want to add them to spark.executor.extraClassPath property.

However, since I'm serializing these jars with spark.jars, their path in the workers is dynamic and includes the app-id & executor-id - <some-work-dir>/<app-id>/<executor-id>.

Is there a way around it? is it possible to add a dir inside the app dir to be first in classpath?

Working with Spark 2.4.5 Standalone client mode - Docker.

p.s I'm aware of the option to add the jar to the workers image, and then add it to the classpath, but then I'll have to keep updating the image with every code change.

LiranBo
  • 2,054
  • 2
  • 23
  • 39

1 Answers1

1

You can enable this option on spark submit:

spark.driver.userClassPathFirst=True

Check here the spark-submit options documentation

diogoramos
  • 86
  • 7
  • I think u meant - spark.executor.userClassPathFirst. and the docs are not so clear about it. "user-added jars precedence over Spark's own jars when loading classes.." user added jars means spark.jars files? – LiranBo Aug 11 '20 at 12:05
  • Exactly the jars you passed to a spark job (using --jars) have precedence over Spark's own jars. Note that this function is still experimental but if it fix your problem you can use it. – diogoramos Aug 11 '20 at 14:07
  • The only downside ATM is that it adds everything, and not a single jar. but that's a start:) – LiranBo Aug 11 '20 at 14:20