Using a JAR dependency in a PySpark parallelized execution context

Question

This is for a PySpark / Databricks project:

I've written a Scala JAR library and exposed its functions as UDFs via a simple Python wrapper; everything works as it should in my PySpark notebooks. However, when I try to use any of the functions imported from the JAR in an sc.parallelize(..).foreach(..) environment, execution keeps dying with the following error:

TypeError: 'JavaPackage' object is not callable

at this line in the wrapper:

jc = get_spark()._jvm.com.company.package.class.get_udf(function.__name__)

My suspicious is that the JAR library is not available in the parallelized context, since if I replace the library path to some gibberish, the error remains exactly the same.

I haven't been able to find the necessary clues in the Spark docs so far. Using an sc.addFile("dbfs:/FileStore/path-to-library.jar") didn't help.

score 0 · Answer 1 · answered Apr 02 '19 at 09:03

0

You could try adding the JAR in the PYSPARK_SUBMIT_ARGS environment variable (before Spark 2.3 this was doable with SPARK_CLASSPATH as well).

For example with:

os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars <path/to/jar> pyspark-shell'

answered Apr 02 '19 at 09:03

char

2,063
3
15
26

Using a JAR dependency in a PySpark parallelized execution context

1 Answers1