1

I have a class in java that builds some sophisticated Spark DataFrame.

package companyX;

class DFBuilder {
   public DataFrame build() {
       ...
       return dataframe;
   }
}

I add this class to the pyspark/jupiter classpath so its callable by py4j. Now when I call it I get strange type:

b = sc._jvm.companyX.DFBuilder()
print(type(b.build()))
#prints: py4j.java_gateway.JavaObject

VS

print(type(sc.parallelize([]).toDF()))
#prints: pyspark.sql.dataframe.DataFrame

Is there a way to convert this JavaObject into proper pyspark dataframe? One of the problems I have is that when I want to call df.show() on a DataFrame build in java is that it gets printed in spark logs, and not in notebook cell.

Piotr Reszke
  • 1,576
  • 9
  • 21

3 Answers3

3

You can use DataFrame initializer:

from pyspark.sql import DataFrame, SparkSession

spark = SparkSession.builder.getOrCreate()

DataFrame(b.build(), spark)

If you use outdated Spark version replace SparkSession instance with SQLContext.

Refference Zeppelin: Scala Dataframe to python

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
2

As of spark 2.4 you still should be using SQLContext instead of SparkSession when wrapping scala dataframe in python one. Some relevant pyspark session code:

self._wrapped = SQLContext(self._sc, self, self._jwrapped)
...
# in methods returning DataFrame
return DataFrame(jdf, self._wrapped)

If SparkSession gets passed some methods like toPandas() won't work with such DataFrame.

0

For someone with sparkSession object, even with newer spark (like 3.2)

# sparkSession
spark = SparkSession.builder.master("local[*]") \
    .appName('sample') \
    .getOrCreate()

# py4j.java_gateway.JavaObject
javaOjbectDf= spark._jvm.com.your.javaPackage.DfBuilder()

sqlContext = SQLContext(sparkContext=spark.sparkContext, sparkSession=spark)
df_from_java = DataFrame(javaOjbectDf, sqlContext)

# python DataFrame
print(df_from_java)
Akash Sharma
  • 106
  • 1
  • 2