3

I was trying to run a java function in pyspark using py4j. Py4j enables accessing java objects in a JVM. I created another instance of a JVM and was able run the java function successfully.

py4j enables this communication via GatewayServer instance.

I was wondering if we could somehow access spark's internal JVM to run my java function? What is the entry point for the py4j Gatewayserver in spark ? How can I add my function to the entry point ?

Sanjay Kumar
  • 31
  • 1
  • 5
  • 1
    Could you [edit the question](https://stackoverflow.com/posts/35774347/edit) tp provide some background? How do you want to use it? Py4J has very limited scope in Spark. – zero323 Mar 03 '16 at 18:39

2 Answers2

4

I am not sure if this is what you need but there are two places I have seen:

sc._gateway.jvm

which can be used for java_import or directly

sc._jvm

So to access class X in package a.b.c you can do one of the following:

jvm = sc._gateway.jvm
java_import(jvm,"a.b.c.X")
instance = a.b.c.X()

or more directly:

instance = sc._jvm.a.b.c.X()

To add a java function you would need to make sure it is in the classpath and if you want to use it in the workers (e.g. in a UDF) then you need to send it to the workers. To achieve that you can use the --driver-class-path switch to spark-submit (or pyspark) to add to the driver and --jars to send to the workers.

Assaf Mendelson
  • 12,701
  • 5
  • 47
  • 56
  • 1
    How can you access a jvm from the executors? `sc._jvm` gets a JVM, but SparkContext isn't accessible from the executors - and simply creating a `JavaGateway` tries to connect to a (non-running) GatewayServer. Should I be starting a GatewayServer on my executors? – scubbo Aug 12 '19 at 00:21
1

Look at

$SPARK_HOME/python/pyspark/java_gateway.py

You will see there the mechanisms used to interface with the Java/Scala backend.

You will need to update one or more of the Java files as shown here:

java_import(gateway.jvm, "org.apache.spark.SparkConf")
java_import(gateway.jvm, "org.apache.spark.api.java.*")
java_import(gateway.jvm, "org.apache.spark.api.python.*")
java_import(gateway.jvm, "org.apache.spark.mllib.api.python.*")
# TODO(davies): move into sql
java_import(gateway.jvm, "org.apache.spark.sql.*")
java_import(gateway.jvm, "org.apache.spark.sql.hive.*")
java_import(gateway.jvm, "scala.Tuple2")

These represent the Spark-Java entry points.

Pyspark uses the Spark-Java entry points instead of going directly to Scala. You need to either - (a) use the existing ones in those API classes or - (b) add new entry points in those classes and build your own version of Spark

WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560