This is for a PySpark / Databricks project:
I've written a Scala JAR library and exposed its functions as UDFs via a simple Python wrapper; everything works as it should in my PySpark notebooks. However, when I try to use any of the functions imported from the JAR in an sc.parallelize(..).foreach(..)
environment, execution keeps dying with the following error:
TypeError: 'JavaPackage' object is not callable
at this line in the wrapper:
jc = get_spark()._jvm.com.company.package.class.get_udf(function.__name__)
My suspicious is that the JAR library is not available in the parallelized context, since if I replace the library path to some gibberish, the error remains exactly the same.
I haven't been able to find the necessary clues in the Spark docs so far. Using an sc.addFile("dbfs:/FileStore/path-to-library.jar")
didn't help.