0

I am trying to convert a java dataframe to a pyspark dataframe. For this I am creating a dataframe(or dataset of Row) in java process and starting a py4j.GatewayServer server process on java side. Then on python side I am creating a py4j.java_gateway.JavaGateway() client object and passing this to pyspark's SparkContext constructor to link it to the jvm process already started. But I am getting this error :-

File: "path_to_virtual_environment/lib/site-packages/pyspark/conf.py", line 120, in __init__
    self._jconf = _jvm.SparkConf(loadDefaults)
TypeError: 'JavaPackage' object is not callable

Can someone please help ? Below is the code I am using:-

Java Code:-

import py4j.GatewayServer
public class TestJavaToPythonTransfer{
    Dataset<Row> df1;
    public TestJavaToPythonTransfer(){
        SparkSession spark = 
              SparkSession.builder().appName("test1").config("spark.master","local").getOrCreate();
        df1 = spark.read().json("path/to/local/json_file");
    }
    public Dataset<Row> getDf(){
        return df1;  
    }
    public static void main(String args[]){
       GatewayServer gatewayServer = new GatewayServer(new TestJavaToPythonTransfer());
       gatewayServer.start();
       System.out.println("Gateway server started");
    }
}

Python code:-

from pyspark.sql import SQLContext, DataFrame
from pyspark import SparkContext, SparkConf
from py4j.java_gateway import JavaGateway
gateway = JavaGateway()
conf = SparkConf().set('spark.io.encryption.enabled','true')
py_sc = SparkContext(gateway=gateway,conf=conf)
j_df = gateway.getDf()
py_df = DataFrame(j_df,SQLContext(py_sc))
print('print dataframe content')
print(dpy_df.collect())

Command to run python code:-

python path_to_python_file.py

I also tried doing this:-

$SPARK_HOME/bin/spark-submit --master local path_to_python_file.py

But here though the code is not throwing any error but it is not printing anything to terminal. Do I need to set some spark conf for this?

P.S - apologies in advance if there is a typo mistake in code or mistake, since I could not copy the code and error stack directly from my firm's IDE.

Aditya
  • 1
  • 1
  • 1

1 Answers1

0

There is a missing call to entry_point before calling getDf()

So, try this:

app = gateway.entry_point
j_df = app.getDf()

Additionally, I have create working copy using Python and Scala (hope you dont mind) below that shows how on Scala side py4j gateway is started with Spark session and a sample DataFrame and on Python side I have accessed that DataFrame and converted to Python List[Tuple] before converting back to a DataFrame for a Spark session on Python side:

Python:

from py4j.java_gateway import JavaGateway
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, IntegerType, StructField

if __name__ == '__main__':
    gateway = JavaGateway()

    spark_app = gateway.entry_point
    df = spark_app.df()

    # Note "apply" method here comes from Scala's companion object to access elements of an array
    df_to_list_tuple = [(int(i.apply(0)), int(i.apply(1))) for i in df]

    spark = (SparkSession
             .builder
             .appName("My PySpark App")
             .getOrCreate())

    schema = StructType([
        StructField("a", IntegerType(), True),
        StructField("b", IntegerType(), True)])

    df = spark.createDataFrame(df_to_list_tuple, schema)

    df.show()

Scala:

import java.nio.file.{Path, Paths}

import org.apache.spark.sql.SparkSession
import py4j.GatewayServer

object SparkApp {
  val myFile: Path = Paths.get(System.getProperty("user.home") + "/dev/sample_data/games.csv")
  val spark = SparkSession.builder()
    .master("local[*]")
    .appName("My app")
    .getOrCreate()

  val df = spark
      .read
      .option("header", "True")
      .csv(myFile.toString)
      .collect()

}

object Py4JServerApp extends App {


  val server = new GatewayServer(SparkApp)
  server.start()

  print("Started and running...")
}

Khalid Mammadov
  • 511
  • 4
  • 6