1

I am looking for some help or example code that illustrates pyspark calling user written Java code outside of spark itself that takes a spark context from Python and then returns an RDD built in Java.

For completeness, I'm using Py4J 0.81, Java 8, Python 2.7, and spark 1.3.1

Here is what I am using for the Python half:

import pyspark
sc = pyspark.SparkContext(master='local[4]',
                          appName='HelloWorld')

print "version", sc._jsc.version()

from py4j.java_gateway import JavaGateway
gateway = JavaGateway()

print gateway.entry_point.getRDDFromSC(sc._jsc)

The Java portion is:

import java.util.Map;
import java.util.List;
import java.util.ArrayList;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;

import py4j.GatewayServer;

public class HelloWorld 
{
   public JavaRDD<Integer> getRDDFromSC(JavaSparkContext jsc)
   {
      JavaRDD<Integer> result = null;
      if (jsc == null)
      {
         System.out.println("XXX Bad mojo XXX");

         return result;
      }

      int n = 10;
      List<Integer> l = new ArrayList<Integer>(n);
      for (int i = 0; i < n; i++) 
      {
         l.add(i);
      }

      result = jsc.parallelize(l);

      return result;
   }

   public static void main(String[] args)
   {
      HelloWorld app = new HelloWorld();
      GatewayServer server = new GatewayServer(app);
      server.start();
   }
}

Running produces on the Python side:

$ spark-1.3.1-bin-hadoop1/bin/spark-submit main.py
version 1.3.1
sc._jsc <class 'py4j.java_gateway.JavaObject'>
org.apache.spark.api.java.JavaSparkContext@50418105
None

The Java side reports:

$ spark-1.3.1-bin-hadoop1/bin/spark-submit --class "HelloWorld" --master local[4] target/hello-world-1.0.jar
XXX Bad mojo XXX

The problem appears to be that I am not correctly passing the JavaSparkContext from Python to Java. The same failure of the JavaRDD being null occurs when I use from python sc._scj.sc().

What is the correct way to invoke user defined Java code that uses spark from Python?

Holden
  • 7,392
  • 1
  • 27
  • 33

1 Answers1

1

So I've got an example of this in a branch that I'm working on for Sparkling Pandas The branch lives at https://github.com/holdenk/sparklingpandas/tree/add-kurtosis-support and the PR is at https://github.com/sparklingpandas/sparklingpandas/pull/90 .

As it stands it looks like you have two different gateway servers which seems like it might cause some problems, instead you can just use the existing gateway server and do something like:

sc._jvm.what.ever.your.class.package.is.HelloWorld.getRDDFromSC(sc._jsc)

assuming you make that a static method as well.

Holden
  • 7,392
  • 1
  • 27
  • 33