2

This is for a PySpark / Databricks project:

I've written a Scala JAR library and exposed its functions as UDFs via a simple Python wrapper; everything works as it should in my PySpark notebooks. However, when I try to use any of the functions imported from the JAR in an sc.parallelize(..).foreach(..) environment, execution keeps dying with the following error:

TypeError: 'JavaPackage' object is not callable

at this line in the wrapper:

jc = get_spark()._jvm.com.company.package.class.get_udf(function.__name__)

My suspicious is that the JAR library is not available in the parallelized context, since if I replace the library path to some gibberish, the error remains exactly the same.

I haven't been able to find the necessary clues in the Spark docs so far. Using an sc.addFile("dbfs:/FileStore/path-to-library.jar") didn't help.

dpq
  • 9,028
  • 10
  • 49
  • 69

1 Answers1

0

You could try adding the JAR in the PYSPARK_SUBMIT_ARGS environment variable (before Spark 2.3 this was doable with SPARK_CLASSPATH as well).

For example with:

os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars <path/to/jar> pyspark-shell'

char
  • 2,063
  • 3
  • 15
  • 26