I am trying to convert a java dataframe to a pyspark dataframe. For this I am creating a dataframe(or dataset of Row) in java process and starting a py4j.GatewayServer server process on java side. Then on python side I am creating a py4j.java_gateway.JavaGateway() client object and passing this to pyspark's SparkContext constructor to link it to the jvm process already started. But I am getting this error :-
File: "path_to_virtual_environment/lib/site-packages/pyspark/conf.py", line 120, in __init__
self._jconf = _jvm.SparkConf(loadDefaults)
TypeError: 'JavaPackage' object is not callable
Can someone please help ? Below is the code I am using:-
Java Code:-
import py4j.GatewayServer
public class TestJavaToPythonTransfer{
Dataset<Row> df1;
public TestJavaToPythonTransfer(){
SparkSession spark =
SparkSession.builder().appName("test1").config("spark.master","local").getOrCreate();
df1 = spark.read().json("path/to/local/json_file");
}
public Dataset<Row> getDf(){
return df1;
}
public static void main(String args[]){
GatewayServer gatewayServer = new GatewayServer(new TestJavaToPythonTransfer());
gatewayServer.start();
System.out.println("Gateway server started");
}
}
Python code:-
from pyspark.sql import SQLContext, DataFrame
from pyspark import SparkContext, SparkConf
from py4j.java_gateway import JavaGateway
gateway = JavaGateway()
conf = SparkConf().set('spark.io.encryption.enabled','true')
py_sc = SparkContext(gateway=gateway,conf=conf)
j_df = gateway.getDf()
py_df = DataFrame(j_df,SQLContext(py_sc))
print('print dataframe content')
print(dpy_df.collect())
Command to run python code:-
python path_to_python_file.py
I also tried doing this:-
$SPARK_HOME/bin/spark-submit --master local path_to_python_file.py
But here though the code is not throwing any error but it is not printing anything to terminal. Do I need to set some spark conf for this?
P.S - apologies in advance if there is a typo mistake in code or mistake, since I could not copy the code and error stack directly from my firm's IDE.