0

I have a scala library which (to put it simple) receives a function, applies it to an RDD and returns another RDD

def runFunction(rdd: RDD, function: Any => Any) = {
    ....
    val res = rdd.map(function) 
    ...
}

In scala the usage would be

import mylibrary.runFunction
runFunction(myRdd, myScalaFun)

This library is packaged in a jar and I want to now use it in python too. What I would like to do is to load this library in Python and pass to it a python function. Usage in Python would be:

spark._jvm.mylibrary.runFunction(myPythonRdd, myPythonFun)

This would allow me to use python functions as well as Scala ones without the need to port the whole library to python. Is this something that can be achieved with Spark capabilities of going back and forth between Python and JVM?

alexlipa
  • 1,131
  • 2
  • 12
  • 27
  • In my opinion, all this can quickly become hard to maintain... If you do not just use scala because you need/like python ecosystem (for dataviz, ml ...) I suggest you to have a look to [Netflix's polynote](https://polynote.org/) that allow you to seamlessly mix both languages into one notebook, with fine Spark support. – bonnal-enzo Nov 14 '19 at 11:49

1 Answers1

0

There are some subtleties in the way Python and JVM in PySpark communicate. The bridge uses Java objects, i.e., JavaRDD and not RDD, and those need explicit unboxing in Scala. Since your Scala function takes an RDD, you need to write a wrapper in Scala that receives a JavaRDD and performs the unboxing first:

def runFunctionWrapper(jrdd: JavaRDD, ...) = {
  runFunction(jrdd.rdd, ...)
}

Then call it like

spark._jvm.mylibrary.runFunctionWrapper(myPythonRdd._jrdd, ...)

Note that by Python convention, _jrdd is considered a private member of the Python RDD class, so this is effectively relying on an undocumented implementation detail. Same applies to the _jvm member of SparkContext.

The real problem is making Scala call back into Python for the application of function. In PySpark, the Python RDD's map() method creates an instance of org.apache.spark.api.python .PythonFunction, which holds pickled reference to the Python mapper function together with its environment. Then each RDD partition gets serialised and together with the pickled stuff sent over TCP to a Python process colocated with the Spark executor, where the partition is deserialised and iterated over. Finally, the result gets serialised again and sent back to the executor. The whole process is orchestrated by an instance of org.apache.spark.api.python.PythonRunner. This is very different from building a wrapper around the Python function and passing it to the map() method of the RDD instance.

I believe it is best if you simply replicate the functionality of runFunction in Python or (much better performance-wise) replicate the functionality of myPythonFun in Scala. Or, if what you do can be done interactively, follow the suggestion of @EnzoBnl and make use of a polyglot notebook environment like Zeppelin or Polynote.

Hristo Iliev
  • 72,659
  • 12
  • 135
  • 186