There are some subtleties in the way Python and JVM in PySpark communicate. The bridge uses Java objects, i.e., JavaRDD
and not RDD
, and those need explicit unboxing in Scala. Since your Scala function takes an RDD
, you need to write a wrapper in Scala that receives a JavaRDD
and performs the unboxing first:
def runFunctionWrapper(jrdd: JavaRDD, ...) = {
runFunction(jrdd.rdd, ...)
}
Then call it like
spark._jvm.mylibrary.runFunctionWrapper(myPythonRdd._jrdd, ...)
Note that by Python convention, _jrdd
is considered a private member of the Python RDD
class, so this is effectively relying on an undocumented implementation detail. Same applies to the _jvm
member of SparkContext
.
The real problem is making Scala call back into Python for the application of function
. In PySpark, the Python RDD's map()
method creates an instance of org.apache.spark.api.python
.PythonFunction
, which holds pickled reference to the Python mapper function together with its environment. Then each RDD partition gets serialised and together with the pickled stuff sent over TCP to a Python process colocated with the Spark executor, where the partition is deserialised and iterated over. Finally, the result gets serialised again and sent back to the executor. The whole process is orchestrated by an instance of org.apache.spark.api.python.PythonRunner
. This is very different from building a wrapper around the Python function and passing it to the map()
method of the RDD
instance.
I believe it is best if you simply replicate the functionality of runFunction
in Python or (much better performance-wise) replicate the functionality of myPythonFun
in Scala. Or, if what you do can be done interactively, follow the suggestion of @EnzoBnl and make use of a polyglot notebook environment like Zeppelin or Polynote.