0

I have a java class registered in PySpark, and Im trying to pass a Broadcast variable from PySpark to a method in this class. Like so:

from py4j.java_gateway import java_import
java_import(spark.sparkContext._jvm, "net.a.b.c.MyClass")
myPythonGateway = spark.sparkContext._jvm.MyClass()

with open("tests/fixtures/file.txt", "rb") as binary_file:
    data = spark.sparkContext.broadcast(binary_file.read())
    myPythonGateway.setData(data)

But this is throwing:

AttributeError: 'Broadcast' object has no attribute '_get_object_id'

However, if I pass the byte[] directly, without wrapping it in broadcast(), it works fine. But I need this variable to be broadcast, as it will be used repeatedly.

Chris
  • 1,335
  • 10
  • 19
Dexter
  • 1,710
  • 2
  • 17
  • 34

1 Answers1

0

According to the py4j docs, the above error will be thrown if you try to pass a Python collection to a method that expects a Java collection. The docs give the following solution:

You can explicitly convert Python collections using one of the following converter located in the py4j.java_collections module: SetConverter, MapConverter, ListConverter.

An example is provided there also.

Presumably, this error is occurring when py4j tries to convert the value attribute of the Broadcast object, so converting this may fix the problem e.g.

converted_data = ListConverter().convert(binary_file.read(),spark.sparkContext._jvm._gateway_client)
broadcast_data = spark.sparkContext.broadcast(converted_data)
myPythonGateway.setData(broadcast_data)

Chris
  • 1,335
  • 10
  • 19